Semantic Retrieval with Enhanced Matchmaking ... - Semantic Scholar

3 downloads 4064 Views 89KB Size Report
We did a re-ranking of Google results(with Google's top 200 returned images for each animal category as our initial data set). In this paper, we propose our own ...
SEMANTIC RETRIEVAL WITH ENHANCED MATCHMAKING AND MULTI-MODALITY ONTOLOGY Huan Wang, Liang-Tien Chia and Song Liu Nanyang Technological University, Singapore 639798 {wa0004an, asltchia, pg03988006}@ntu.edu.sg ABSTRACT

where the emphasis is still very much on textual information. Other attempts in more complex domains implement a rather straightforward combination. Neither semantic relationship between different categories nor priority of information is considered. We build a multi-modality ontology for the animal image annotation and retrieval task. The important problems include how to construct an ontology which can integrate different modalities, how to quantize the matching degree and how is the scalability of this approach when applying to large domains. The first problem is solved and elaborated in [3]. We did a re-ranking of Google results(with Google’s top 200 returned images for each animal category as our initial data set). In this paper, we propose our own ranking mechanism using Spearman’s rank correlation to measure the similarity of concepts in the ontology. In both papers, we use the RACER[4] semantic matchmaker on our multimodality ontology. At the same time, we prove that this method is scalable in both construction phase and execution phase.

This paper describes a semantic retrieval system that allows matchmaking with ranked output and the use of multi-modality ontology to retrieve animal images. Our multi-modality ontology, which integrates both image features and text information, is extended to provide a ranking mechanism. Ranking is calculated from correlation in each modality and is used to refine the semantic matchmaking result. To benchmark our results, we use the top 200 images of Google Image Search for each category to do the experimental comparison. Google Image Search claims to be the most comprehensive on the Web, with billions of images indexed and available for viewing. For different categories of animals in the canine family, we found averages of about 60% of top 200 images are correct images. Google returns even more false results outside this range. Therefore any bigger image set will become meaningless in our experiment. The medium size of the data set is fine since we are testing the retrieval performance on web images and concerned mainly with the precision of top retrievals. We believe the canine domain is challenging as demonstrated by the visual variance of objects and backgrounds. Twenty animal categories, containing animal images and corresponding web pages, are collected to form a systematic animal family. Results show that we can classify perceptually close animal species which share similar appearances as we can infer their hidden relationships from the canine family graph. By assigning a ranking to the semantic relationships we show unequivocal evidence that our improved model achieves good accuracy.

2. PREVIOUS WORK In this section, we will briefly describe some of our previous works on ontology construction. Three sub ontologies constitute our multi-modality ontology. The first sub ontology is the Animal Domain Ontology, which contains a hierarchical tree structure for the animal ontology based on the formal definition of animal taxonomy. We derive the taxonomy from WordNet[5], which is an online dictionary providing general definitions in various domains. The taxonomy information serves as the expert knowledge in constructing the knowledge base. In the tree structure, we refine the hyponyms relationship between two concepts as subclass property in the ontology. For example, the hyponyms relationship between fox and canine is redefined as a subclass relationship. This Animal Domain Ontology enables retrieval of not only a single animal category but also related animal categories. Taking image retrieval for wild dog as an example, although there is no text description with words like wild or dog for a dhole image, it will be correctly retrieved, as dhole is defined as a subclass of wild dog. The second sub ontology is the Textual Description Ontology which encapsulates narrative animal description.

Keywords: Web Image Retrieval, Multi-Modality Ontology, Semantic Matchmaking, Ranking Correlation 1. INTRODUCTION Automatic image annotation and retrieval is always difficult, especially in domains with complex objects and backgrounds. Toward this goal, many ongoing researches are now focusing on bridging the semantic gap via different approaches. Besides all the attempts of extracting highlevel concepts from just image features, more works are trying to derive semantic content from both image features and surrounding text. S.Radhouani et al.[1] combined textual and visual ontology for medical image retrieval. T. L. Berg et al.[2] used a linear and equal weight combination of 4 independent cue scores from both visual cues and text cues. However, we can see that ontology approaches are still limited in domains like medical image retrieval,

1-4244-1017-7/07/$25.00 ©2007 IEEE

516

ICME 2007

For this part, the domain knowledge is collected from the BBC Science & Nature Animal Category1 . This Science Category provides standard descriptions for around 620 animals. We are able to extract information like distribution, habitat, etc., and define them as semantic relationships like hasDistribution, hasHabitat to construct our ontology. This ontology will be applied in text information and is able to retrieve the target even if the surrounding text does not contain exact matching keyword. For example, if the surrounding text only talks about a kind of fox living in Arctic area, this image will be annotated automatically as Arctic fox since it is the only fox species living in Arctic area. The last sub ontology is the Visual Description Ontology. This ontology is constructed from high-level concepts derived from lowlevel image features. We use supervised learning technology to extract image features like color correlogram, MPEG7 color structure descriptor and edge histogram descriptor, etc. And these image features are also connected with the animal concepts using their semantic relationships. With this multi-modality ontology, information from both image feature and text information is organized in a semantically sound structure and becomes a much richer information set than other linear combination of keywords.

further reasoning. 4. RANKING CORRELATION WITH MULTI-MODALITY ONTOLOGY In this section we discuss our multi-modality ontology model with the ranking correlation. Most semantic matchmaking approaches only provide three results: Exact Match, Subsume Match and Disjoint Match. Exact Match is considered to be the best matching result, since the input query concept is exactly the same as the predefined concept. Subsume Match is chosen as the next preferred match, as the input query concept subsumes several predefined concepts, which means it could be annotated as one of several concepts. Disjoint Match comes last because the input query concept does not belong to any predefined concept. And in our previous model, when displaying the retrieval result, we provided a ranking according to the aforementioned criteria. A further ranking was given according to the degree of Subsume Match. The more concepts an input query concept subsumes, the lower its ranking in the final result. Since if a concept subsumes more concepts, the possibility that it belongs to any of these concepts is also smaller. In most cases, this ranking mechanism provides satisfying result. However, we would like to quantize the measurement of the similarity between different concepts under the same semantic matchmaking result using a global method. In order to solve this problem, we exploit another set of information which was neglected in our previous model: the ranking of semantic relationships in concept. Ontology is composed of relationships between concepts, which come from different modalities. Our knowledge base is constructed by predefined concepts. These concepts include the animal concepts, which are used to matchmake with and annotate to the web image, and general concepts such as color, distribution, etc.. The general concepts are defined to construct the animal concepts and these two kinds of concepts are connected by semantic relationships, which are also defined in the knowledge base. During the image analysis process, a generated concept is automatically constructed for each image according to the information extracted from text analysis and image feature analysis. This generated concept, which is also known as an anonymous concept, is reasoned over the animal concepts in the knowledge base to check if it matches any predefined animal concept. Different from [2], semantic relationships should not have equal weight according to human knowledge. Taking the animal concept in our experiment as an example, hasName relationship should have higher priority than hasColor relationship, as different animals may share same color, but not same name. Our aim is to build a similarity measure system according to different priorities of semantic relationships. We are inspired by the different semantic relationship frequencies extracted from web pages and propose a mechanism whose general idea is that the frequency of a particular word in one web page indicates the priority of that word, thus decides its degree of relevance to the web page subject.

3. DATA SET In our experiment we have collected 4000 images for 20 canine categories. This domain structure is still being extended now. Another important purpose of this domain structure is to build an animal database that can be reused as reference in our future researches. In this domain, we first classify the animals according to their main class. There are 5 main subclasses under canine class: wolf, fox, wild dog, jackal and hyena. We make further classification in the second level according to domain knowledge provided by WordNet. The principle is to classify all the animals into different categories without overlapping or cross reference. We use Google Image Search to set up our data set and stop at an initial data set containing more than 4000 images of 20 animal categories. For each image, its corresponding web page is also downloaded for text information analysis. Web pages usually contain sparse information and in many cases most of the information does not have explicit relationship with the image in that web page. It is challenging to retrieve the exact information which can help classify and annotate the image. This is especially important when we are dealing with information on web that contains a lot of noise. Some works[2, 6] use probabilistic method, a popular approach, to process the text information. However, there is no guarantee that only information with clear semantic relationship is extracted. In our experiment, we first use WordNet to remove the redundant information. In the next step, we reuse the semantic relationships defined in the knowledge base and implement the RACER[4] semantic matchmaker to extract the concepts and relationships for 1 http://www.bbc.co.uk/nature/wildfacts/animals

a z.shtml

517

Aardwolf Retrieval Result

Number of Correct Images Retrieved

90

60 50 40 30 20 10 0

Golden Jackal Retrieval Result

Coyote(Google) Coyote(M−M Ontology) Coyote(M−M Ontology with ranking correlation) Coyote(Optimal)

80 70 60 50 40 30 20

0

50 100 150 Number of Images Retrieved(in ranking order)

200

140 120 100 80 60 40 20

10 0

Golden Jackal(Google) Golden Jackal(M−M Ontology) Golden Jackal(M−M Ontology with ranking correlation) Golden Jackal(Optimal)

160 Number of Correct Images Retrieved

Aardwolf(Google) Aardwolf(M−M Ontology) Aardwolf(M−M Ontology with ranking correlation) Aardwolf(Optimal)

70 Number of Correct Images Retrieved

Coyote Retrieval Result 100

80

0

50 100 150 Number of Images Retrieved(in ranking order)

200

0

0

50 100 150 Number of Images Retrieved(in ranking order)

200

Fig. 1. A comparison of image retrieval results between different approaches (1) We first assign rankings to the semantic relationships in the ontology according to domain knowledge. After we process the text information in the web page and extract the semantic relationships, we keep a record of the frequency of each semantic relationships and calculate their ranking according to the frequency. The higher the frequency, the higher the final ranking. The underlying assumption is that if certain semantic relationship appears more frequently than others in a web page, the information it contains is more relevant to the image subject, and should have higher priority. When we are giving the predefined rankings, we treat semantic relationships from each modality differently. To make sure these relationships from low-level image features will affect the final degree of similarity significantly more than the text information relationships, we assign a high value when a positive image feature relationship is discovered.

in the second web page, hasName has a third rank and hasColor has a second rank. The ranking vector for the second image is [1,3,2]. We combine Spearman’s ranking correlation with our multi-modality ontology model to refine the result of semantic matchmaking. Spearman’s ranking correlation uses an arbitrary function to measure the relationship between two variables regardless of their frequency distribution. And the rank correlation coefficient is denoted by ρ and given by:  6 d2i (1) ρ=1− n(n2 − 1) In our case, the variables are vectors representing the semantic relationships at ordinal level, which makes Spearman’s ranking correlation applicable. The values of ρ for the 2 images in our example are 1 and 0.5. This result is expected, since hasName is a more important relationship than hasColor, the expected ranking of hasName is higher than hasColor. Therefore, even if the matchmaking results are the same for the two images. The fist web page introduces a higher similarity. In the above case, if a false hasImFurColor relationship is discovered for both images, we assign 0 to x1 and y1 and the rankings for hasImFurColor drop down. Therefore the final result will be affected more by the relationships from image feature modality as expected.

The following is a simplified example. Suppose we are now dealing with 3 relationships: hasImFurColor, which comes from the image feature modality and is used to describe the texture and color of the object in the image; hasName and hasColor, both of which are from the text information modality. Relationships describing low-level features are obtained from image classification process. Although they do not have frequency of occurrence as word information, we preassign a value for their frequency count. The expected ranking of hasImFurColor, hasName and hasColor is defined in a ranking vector V as [1,2,3]. Now, we consider 2 images with their corresponding web pages and these 2 images have the same semantic matchmaking result. In the first web page, the number of extracted hasImFurColor, hasName, and hasColor relationships are x1, x2, and x3 and in the second web page, the number of extracted hasImFurColor, hasName, hasColor are y1, y2 and y3 respectively. In this case, we have found that x2=7, x3=5, y2=3, y3=6. And for both images we discover a positive hasImFurColor relationship. Therefore x1 and y1 are assigned a large enough value to guarantee their ranking as first. Since x2>x3, hasName has a second rank and hasColor has a third rank in the first web page. The ranking vector for the first image will therefore be [1,2,3]. Similarly,

5. EXPERIMENTAL RESULTS Currently our experiment involves 20 types of canines to evaluate the multi-modality ontology’s performance. The test set is set up by the top 200 Google Images of each canine class. In this section, we will show the experimental results for the multi-modality ontology with ranking correlation and then we compare our retrieval results with both Google Image Search and the result set when we use our multi-modality ontology without ranking correlation to reranking Google’s result[3]. Results of our animal ontology-based image retrieval is presented in Figure 1. In this figure, the M-M Ontology is short for Multi-Modality Ontology. We use Google Image results as the base-line and compare it with our approach.

518

Top 20 Image Retrieval Result

Google result M−M Ontology result(re−ranking with Google initial ranking) M−M Ontology with ranking correlation result(w/o Google ranking)

45

Number of Correct Images Retrieved

Number of Correct Images Retrieved

Top 40 Image Retrieval Result 50

Google result M−M Ontology result(re−ranking with Google initial ranking) M−M Ontology with ranking correlation result(w/o Google ranking)

25

20

15

10

5

40 35 30 25 20 15 10 5

0

0

2

4

6

8

10

12

14

16

18

0

20

0

2

4

6

Animal Subspecies

100

12

14

16

18

20

Google result M−M Ontology result(re−ranking with Google initial ranking) M−M Ontology with ranking correlation result(w/o Google ranking)

90

Number of Correct Images Retrieved

Number of Correct Images Retrieved

10

Top 80 Image Retrieval Result

Top 60 Image Retrieval Result Google result M−M Ontology result(re−ranking with Google initial ranking) M−M Ontology with ranking correlation result(w/o Google ranking)

70

8

Animal Subspecies

60

50

40

30

20

80 70 60 50 40 30 20

10

10 0

0

2

4

6

8

10

12

14

16

18

0

20

Animal Subspecies

0

2

4

6

8

10

12

14

16

18

20

Animal Subspecies

Fig. 2. A comparison of image retrieval results between different approaches (2) Due to the space limitation, we only present 3 categories of test result: Aardwolf, Coyote and Golden Jackal. From Figure 1, we can see that the proposed multi-modality ontology with ranking correlation approach outperforms our previous approach, in which we used the multi-modality ontology to re-rank the Google Image Search result. In the rest of the test results for all categories, most groups have comparable results with our previous model. Since we are testing the retrieval performance on web images, we are concerned primarily with the precision of top retrievals. Even for Google Image Search, the average number of correct images for all categories is 106. Figure 2 shows the precision of top 20, top 40, top 60 and top 80 returned results for the 20 animal categories, which from 1 to 20 are Aardwolf, African wild dog, bat-eared fox, black backed jackal, cape fox, Arctic fox, grey fox, red fox, kit fox, bush dog, coyote, dhole, dingo, Ethiopian wolf, fennec fox, golden jackal, grey wolf, maned wolf, red wolf, and spotted hyena. We can see that our new approach provides persistently better results than the Google Image Search baseline and we are able to provide our own ranking without the need to use Google’s initial ranking information.

generated concepts and the predefined concepts. We have shown very promising result in our experiment and have furthermore proved that our multi-modality ontology is scalable. So far the ranking criteria only involves the domain of canine, and a larger animal domain is still being constructed. In the future we will seek ways to improve the ranking mechanism.

6. CONCLUSION

[5] G.A. Miller, R. Beckwith, C.D. Fellbaum, D. Gross, and K.J. Miller, “Wordnet: An on-line lexical database.,” International Journal of Lexicography 3, pp. 235–244, 1990.

We have developed a new ranking mechanism for our multimodality ontology-based image annotation and retrieval using a ranking correlation. By further analysis priority of the semantic relationships which constitute the ontology concepts, we have defined a ranking for these relationships. We are able to calculate a degree of similarity between the

519

7. REFERENCES [1] Said Radhouani, Joo Hwee Lim, Jean-Pierre Chevallet, and Gilles Falquet, “Combining textual and visual ontologies to solve medical multimodal queries,” in 2006 International Conference On Multimedia & Expo, August 2006. [2] Tamara L. Berg and David A. Forsyth, “Animals on the web.,” in CVPR (2). 2006, pp. 1463–1470, IEEE Computer Society. [3] Huan Wang, Song Liu, and Liang-Tien Chia, “Does ontology help in image retrieval?: a comparison between keyword, text ontology and multi-modality ontology approaches,” in MULTIMEDIA ’06: Proceedings of the 14th annual ACM international conference on Multimedia, New York, NY, USA, 2006, pp. 109–112, ACM Press. [4] V. Haarslev and R. Moller, “Racer system description,” in Proceeding of International Joint Conference on Automated Reasoning, 2001, pp. 701–705.

[6] Keiji Yanai and Kobus Barnard, “Probabilistic web image gathering,” in MIR ’05: Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval, New York, NY, USA, 2005, pp. 57–64, ACM Press.