An image retrieval and semi-automatic annotation ... - Semantic Scholar

2 downloads 6492 Views 308KB Size Report
Experimental results have proved that our scheme is effective and efficient and ... Image search engines could also use content based image retrieval (CBIR) ..... Buckley, C., and Salton, G. “Optimization of Relevance Feedback Weights,” in ...
An image retrieval and semi-automatic annotation scheme for large image databases on the Web Xingquan Zhua*, Liu Wenyinb, Hongjiang Zhangb, Lide Wua a

Dept. of Computer Science, Fudan University, Shanghai, PR China {980015, ldwu}@fudan.edu.cn b Microsoft Research China, Beijing, PR China {wyliu, hjzhang}@microsoft.com ABSTRACT

Image annotation is used in traditional image database systems. However, without the help of human beings, it is very difficult to extract the semantic content of an image automatically. On the other hand, it is a tedious work to annotate images in large databases one by one manually. In this paper, we present a web based semi-automatic annotation and image retrieval scheme, which integrates image search and image annotation seamlessly and effectively. In this scheme, we use both low-level features and high-level semantics to measure similarity between images in an image database. A relevance feedback process at both levels is used to refine similarity assessment. The annotation process is activated when the user provides feedback on the retrieved images. With the help of the proposed similarity metrics and relevance feedback approach at these two levels, the system can find out those images that are relevant to the user’s keyword or image query more efficiently. Experimental results have proved that our scheme is effective and efficient and can be used in large image databases for image annotation and retrieval. Keywords: Image annotation, image retrieval, relevance feedback, content-based image retrieval

1.INTRODUCTION With the development of web technology, more and more web sites provide their image search engines. Because it is very hard to extract the semantics of images fully automatically and precisely1., many search engines use annotation technology to manually provide semantics to images. After all images in the database have been annotated, the search engine can use keyword match to retrieve the desired images. That is, users can post a query word to search for the images that are annotated with the query keyword. However, totally manually annotation is labor intensive and may be subjective and random due to the user’s absence of mind. Although image retrieval techniques based on textual features can easily be automated, they suffer from the same problem as the information retrieval systems do in text-document databases and web based search engines do on the Internet full of web-pages. Because of wide spread synonymy and polysemy in natural language, the precision of such systems is very low and their recall is inadequate. For example, both “girl” and “woman” can be used to describe relevant images, but the system cannot retrieve those images annotated with relevant keywords using the simple keyword matching method. Image search engines could also use content based image retrieval (CBIR) technology, which uses only low-level feature similarity to evaluate the distance between a query image and other images in the database2.. However, we can find that the retrieval accuracy is severely limited due to insufficiency of proposed low-level visual features to represent the image content and the complexity of the images for users to describe exactly. Hence, the relevance feedback approach has been proposed to reduce the needs for a user to provide accurate initial queries by estimating the user’s ideal query based on the positive and negative examples given by the user3-9.. Admittedly, relevance feedback can be used to improve the retrieval result. However, the improvement is still unsatisfactory due to the inherent insufficiency of low-level features.

*

This work was performed at Microsoft Research China.

To overcome the limitation of the current image search engines, we present an image search and semi-automatic annotation strategy for large image databases. In the strategy, we integrally implement image search, relevance feedback, and image annotation together. Because of the inadequacy of the separate high-level features and low-level features, we use features at both levels simultaneously in image similarity assessment. Furthermore, the association between the query keywords and the feedback images can be enhanced or weakened when users provide feedback images relevant or irrelevant to the query. Hence, we call it the semi-automatic annotation strategy. The paper is organized as follows. In Section 2, we will present the proposed image search and annotation strategy in detail. In Section 3, we will describe the iFind image retrieval system that we have implemented based on the proposed method and provide experimental evaluations showing its effectiveness in image retrieval. Concluding remarks will be given in Section 4.

2.THE PROPOSED METHOD There are two different user interaction modes involved in typical image retrieval systems: query by keyword and query by example. In the case of query by keyword, the user types in a list of keywords representing the semantic contents of the desired images. In the case of query by example, the user provides a set of example images as the query and the retrieval system tries to retrieve other similar images. In most image retrieval systems, these two modes of interaction are mutually exclusive. We argue that combining these two approaches and allowing them to benefit from each other may yield a great deal of advantage in terms of both retrieval accuracy and ease of use of the system. We will first present the image retrieval strategy, which utilizes both high-level and low-level features. The image annotation strategy is then constructed in the relevance feedback iterations. 2.1 Image Search Strategy In this sub-section, we describe a method to construct a semantic network from an image database and present a simple machine learning algorithm to improve the system’s retrieval and annotation accuracy over time. In addition, we describe a framework in which the previously constructed semantic network can be seamlessly integrated with low-level features. Based on the framework, relevance feedback can also be performed at both levels. 2.1.1 Semantic Network The semantic network is represented by a set of keywords having links to the images in the database. Weight wij is assigned to each individual link to record the importance of the keyword’s description to the content of current image. This representation is shown pictorially as follows.

image 2

image 1

image 3

image M

•••

w12 w21

w11

w13

wn1 w22

keyword

keyword

1

2

•••

Figure 1. Semantic network

•••

wnk

keyword N

The degree of relevance of the keywords to the associated images’ semantic content is represented as the weight on each link. It is clear that an image can be associated with multiple keywords, each of which with a different degree of relevance. Keyword associations may not be available at the beginning. There are several ways to obtain keyword associations. The straightforward method is simply to manually label images. This method may be expensive and time consuming, though it has already been proved to be workable in many systems. Other automated methods include learning the keywords annotation from web pages1.. Another approach to incorporate additional keywords into the system would be to utilize the user’s information. Whenever the user feeds back a set of images being relevant to the current query, we add the query keywords into the system and link them with these images. In section 2.2, we will present our image annotation strategy based on users’ input queries and relevance feedback. 2.1.2 Hierarchical Image Annotation In order to avoid the influence of wide spread synonymy and polysemy in natural language, we use hierarchical annotation method to describe the content of the image. At the beginning of the annotation, we will present a keyword table to define the first layer content of the images in database. That is, we define the first level categories for images to be classified into. As Table 1 shows below, we present 24 keywords as the basic content of the images in our system, each image should be certainly classified into one of these 24 categories. In certain circumstance, it may be hard to determine which category one image belongs to. In this case, users can assign more than one keyword as the first layer content of this image. After we get the first layer content of the image, users can use any keyword to describe the semantic information of the image in detail at the second hierarchical level. Table 2 presents some examples of the second level keywords under “Animal”. Figure 2 shows the annotation hierarchy of an image, which has two keywords to describe its first layer content“Animal, Sports”. There are more keywords--“Mammal”, “Horse”, and “Racing” at the second layer to present image content. Table 1. First Layer Keywords Agriculture Business Fashion Family People Science

Animal China Holiday Military Religion Travel

Architecture Cloth Entertainment Nature Society Vehicle

Art Education Food Plant Sports Texture

Table 2. Examples of Second Layer Keywords of “Animal” Bird Pet Catamount

Mammal Wildlife Primate

Sealife Insect Amphibian

Endangered Reptile

image 1

Animal

Mammal

Horse

Sports

First Layer Content

Racing

Second Layer Content

Figure 2. Hierarchical Image Annotation

2.1.3 Semantic Based Relevance Feedback It is relatively easy to perform semantic based relevance feedback compared to low-level feature based feedback. The basic idea is a simple voting scheme to update the weight wij associated with each link shown in Figure 1. The weight updating process is described as follows. 1. Initialize all weight wij to 1. That is, every keyword has the same importance. 2. Collect the user query and the positive and negative feedback examples. 3. For each keyword in the input query, check to see if any of them is not in the keyword database. If so, add them into the database without creating any links. 4. For each positive example, check to see if any query keyword is not linked to it. If so, create a link with weight 1 from each missing keyword to this image. For all other keywords that are already linked to this image, increase the weight by 1. 5. For each negative example, check to see if any query keyword is linked to it. If so, set the new weight wij’=wij/4. If the weight wij on any link is less than 1, delete that link. It can easily be seen that as more queries are inputted into the system, the system is able to expand its vocabulary. Moreover, through this voting process, the keywords that represent the actual semantic content of each image will receive a larger weight. 2.1.4 High-Level and Low-Level Relevance Feedback Integration The relevance feedback approach to image retrieval is a powerful technique and has been an active research direction for the past few years. There are many papers3-7. that present relevance feedback methods and their implementations in image retrieval systems. Unfortunately, most of the systems use relevance feedback in low-level feature similarity measurement only. In our framework, images in the database are partly annotated by users or system operators. These semantic information can help us find out another relevance feedback approach, which integrates high-level and low-level relevance feedback together. The basic idea behind relevance feedback is to let the computer learn users’ real intention from feedback examples. There are two kinds of relevance feedback mechanism in low-level features. One is to use feedback examples to modify the query3, 4, 6, 7. and the other is to use feedback examples to change the data distribution in the database5, 9.. We can assume that all the keywords associated to positive feedback examples are somewhat relevant to the query of the user’s real intent and the keywords of negative examples are less relevant. For example, if the user selects one image whose keywords are “people”, “girl”, ”pretty”, we can guess that users want to retrieve images with semantic content “people”, “girl”, and “pretty”. With the help of keyword information, we can measure semantic similarity of images, which can be used together with low-level similarity to assess the overall similarity or relevance of the images. In order to incorporate the low-level feature-based feedback with the high-level semantic feedback, we first define some terms (in Eq. 1 and Eq. 2) related to semantic similarity calculation and then define a unified similarity metric function Gj (in Eq. 3) to measure the overall relevance of any image j within the image database to the user’s real intention in terms of both semantic and low-level feature content. k

Assuming that there are Nj keywords associated to images j, let K j be the set of keywords of image j and K j be the kth keyword of image j, k=1, …, .Nj. Rk (defined in Eq. 1) is the relative weighting of keyword k over all keyword weights of image j. l=N

Rk = wk

j

∑w l =1

l

,

(1)

As we can see from Eq. 1, the larger Rk is, the more extent to which the corresponding keyword represents the semantics of the image. +



+

Next, we create two lists L and L in each retrieval and feedback process. L records all keywords associated to all − positive images and L records all keywords associated to all negative feedback images. In each feedback iteration, we

+

will check for each positive feedback image j whether keyword K j ∈ L , k=1, …, Nj . If it is true, we increase the weight k

k

+

k

+

of keyword K j in L by 1, and if not, we add new keyword K j in list L

with weight 1. So does with negative



k

feedback list L . We then use the following equation to calculate the semantic relevance of keyword K j in image j to the feedback images.

ζ

+ k

We define the negative semantic relevance

0 K kj ∉ L +  , =  weight of K kj in L +  k + −  weight of K j in both L and L

(2)

ζ k− similarly. Hence, the values of ζ k+ and ζ k− will determine the user’s

k

interest in keyword K j in image j. The unified similarity metric function Gj is then defined using a modified form of the Rocchio’s formula10., as follows.  1 G j = SIG ( j ) + β   N R

Nj      1  + 1 α ζ k+ ⋅ Rk  S ij   − γ   ∑ ∑  k =1 i∈N R      N N 

Nj      + 1 α ζ k− ⋅ Rk  S ij   ,  ∑ ∑  k =1 i∈N N     

(3)

where, SIG ( j ) indicates the keyword match weight which is given in Eq. 4. Nj is the number of keywords in image j. Mj is the number of matched keywords between the query Q and image j. Rjk is the relative weighting of matched keyword k over all keyword weights of image j. Hence, if image j has the same keyword as the query, SIG ( j ) is determined by three items: the number of keywords in image j, the number of matched keywords, and the relative weights of matched keywords. Otherwise, we assign SIG( j ) with value 0.  M  SIG ( j ) =  N 0 

M

j j



j

k =1

R

jk

Q ∈ K Q ∉ K

,

j

(4)

j

In Eq. 3, NR and NN are the number of positive and negative feedback images, respectively. α ∈ [ 0 ,1] is the weight of the semantic relevance in the overall similarity measurement, which can be specified by users. The larger α is, the more important the semantic relevance will play in the overall similarity measurement. If α = 0 , it means we use only low-level relevance feedback in the image retrieval process. Sij is the low-level feature similarity between the images i and j. The other two parameters β and γ are assigned a value of 1.0 in our current implementation of the system for the sake of simplicity. In fact, Eq. 3 is a modified form of the Rocchio’s formula, in which we add keywords matching weight SIG ( j ) and semantic relevance

ζ k+ and ζ k− in our similarity evaluation. Comparing with low-level features, semantic information

should be more believable. Hence, for those matched images in keywords we directly add an extra similarity SIG ( j ) to Gj according to the number of keywords and the relative weighting of the matched keywords. For other images, we just assume that they are partly relevant to the current query in semantics.

2.2 Image Annotation Strategy In this section, we will present our image annotation strategy based on the retrieval technology introduced above. With the help of relevance feedback, those feedback images will be re-evaluate and their similarity to the current query will be

modified according to users’ perception. In the meantime, we can use user’s feedback to construct our image annotation strategy. There are two kinds of image annotation methods in our strategy: annotation by keyword-based image retrieval and annotation by query by example (QBE). 2.2.1 Annotation by Keyword-Based Image Retrieval In the keyword-based image retrieval process, the semantic network related to all feedback images will be modified according to users’ query keywords and relevance feedback. Such a semantic network modification will result in or update the images’ annotation. The annotation steps are described as follows. 1. The user inputs some query keywords, the system will look up all the keywords in the system’s keyword list. 2. If there is no keyword matched to the current query, the system will return random images as the result. After users select one or more images as relevance feedback, Eq. 3 will be used to get the overall similarity Gj between the relevance feedback examples and images in the database. Those images ranked in the top will be shown to the user as feedback results. 3. If there is a certain keyword identical to the query, those images with the same keyword will be retrieved at first. After that, we use all these images as positive examples to retrieve other images in the database using Eq. 3. The retrieval results will be presented to the user as the result of the first feedback iteration. 4. Users can do more feedback iterations according to their perception or stop the current retrieval at any time. 5. For each positive feedback image, if there is no keyword link to the query, the system will assign the query keyword to it with weight 1. If there is an identical keyword, the system will increase the weight of the keyword by step 1. 6. For each negative example, check to see if any query keyword is linked to it. If so, set the new weight wij’=wij/4. In each feedback iteration, Eq. 3 will be used to retrieve images that are similar to the current relevance feedback. Any image marked as positive example will be annotated immediately and its semantic network can also be modified at the same time. Image annotation is performed in a hidden way without notices from users. 2.2.2 Annotation by QBE At the beginning of image annotation, users do not always know to annotate which kind of images at first. In this situation, they may want to randomly browse images in the database at first and then determine to annotate those images they are interested in. Annotation by QBE is the method that helps users to annotate images in such a circumstance. It is different from annotation by keyword-based image retrieval, in which users should input query keywords at first. In this strategy, users can browse image database randomly at first and then invoke images retrieval process according to the examples they selected. After an acceptable retrieval result is found, one or more keywords can be assigned to those retrieved images. The annotation steps in this method can be described as follows. 1. The user selects some images in the browser by mark them as positive feedback examples. Go to next page to select more images if necessary. 2. The system uses Eq. 3 to retrieve other similar images in database. If the user is not satisfied with the current relevance feedback results, other images can also be selected as feedback examples to do more relevance feedback. 3. After the user finds enough relevant images, the images can be selected as a group and labeled with the annotation keyword (s).

2.3 New Image Registration Adding a batch of new images into the database is a very common operation in image database systems. Those newly inputted images often have no semantic information. In this sub-section, we present a new image registration strategy, which can improve the efficiency of new image annotation significantly. Initially, there have no any annotation information in the database, we can use annotation by the QBE annotation strategy to annotate most of the images in the database. However, it is also a tedious work to annotate all images with this strategy. Hence, the most reasonable way is to associate each of the first hierarchical layer keyword to a small number of images. In our system, we use the QBE strategy to annotate about 100 images for each category. That is, for each first layer keyword, we use this method to annotate at least 100 images to obtain plenty of information for semantic content estimation of new images when they are registered.

After we use QBE based annotation strategy to get the initial annotation of the images in the database, we can use the following steps to get the annotation of other new images. 1. First, we put all new images into a special “unknown” category. For each new image, compare its low-level similarity to all the annotated images in the database. 2. For the top N (e.g., 100, as in our system) most similar images in the database, their keywords are counted. A list of optional keywords sorted by their frequency in these N images is listed. After that, we assigned the first M (we set M=4 in our system) keywords in this list to the new image with weight 1 for each keyword. 3. For each new image, these M keywords guessed from other annotated images are not correct always (and may totally be wrong). Hence, we should let users to confirm the actual semantic content of these new images in the feedback iterations. These guessed M keywords are only used to give this new image some possible semantic information. With these guessed keywords and semantic relevance feedback, this new image can be easily found by our similarity measurement. 4. In each feedback iteration, the system will check whether there is any new image been selected as feedback example. If so, we will check the weight of all keywords associated to this image. If there is any keyword whose weight is greater than threshold τ (we set τ=2 in our system), we will set this keyword as the actual annotation of the image, and remove all other keywords associated to the image. After that, we move the image from the “unknown” category to the category determined by its actual keyword. Obviously, more and more new images will be annotated gradually in this way as the practice of the system increases. Section 3.2.1 will present the experiment result with this new image registration strategy.

3. IMPLMENTATOIN AND EXPERIMENTS 3.1 The System The iFind system is a web based image retrieval system, in which multiple users can perform retrieval and annotation tasks simultaneously at any given time. The system supports four modes of interaction: Image annotation based on semantic and low-level information, keyword based image retrieval, image search by example images, and browsing the entire image database using a pre-defined category information. When the user enters a keyword-based query, the system invokes the combined relevance feedback mechanism discussed in Section 2. The result page is shown in Figure 4. The user is able to select multiple images from this page and click on the “Feedback” button to give positive and negative feedback to our system. The blue tick under the image means the image is marked as positive example and the red-cross means negative example. The system presents 216 images in 12 pages for each query. The first 166 images are actually retrieved using the algorithm outlined in Section 2. The next 50 images are randomly selected from each category. The purpose of presenting the randomly selected images would be to give the user a new starting point if none of the images actually retrieved by our system can be considered relevant. New search result will be presented to the user as soon as the “Feedback” button is pressed. At any time during the retrieval process, the user can click on the “View” link to view a particular image in its original size, or click on the “Similar” link to perform an example based query. The user can even mark several images as positive example then click on the “Annotation” link to directly annotate those images with certain keywords. For the sake of safety, the system does not save the semantic network without users’ confirmation. If users are satisfied with current retrieval and annotation results, they can click on the “Done” link to store the current semantic network, if not, they can click on the “Undo” link to reload the semantic network since last saving operation.

3.2 Experimental Results In order to verifying the efficiency of our new image registration strategy and system performance, two kinds of experimental results are presented below. In our experiments, we use an image database containing 12,000 images, which are selected from the Corel Image Gallery. They are manually classified into 60 categories.

3.2.1 New Image Registration Result Figure 3 presents the result that uses the method described in Section 2.3 to estimate keywords of each new image. In the experiment, there are 5600 initially annotated images in our image database. We put 500 new images into the system at the same time and obtain four keywords (M=4) for each of these 500 images. As shown in figure 3 below, almost 75% of the newly inputted images can be represented by at least one guessed keyword and 10% new images by at least three guessed keywords. Hence, we can annotate a batch of new images efficiently using this new image registration method. Since we use hierarchical image annotation strategy in our system, the result of new image content estimation is relatively better than other methods1. In fact, if our new image registration method fails in getting the real content of the image, we can still use relevance feedback to remove those wrong keywords from the image. With this new image registration strategy, we can improve the efficiency of our system substantially. This strategy can not learn new keywords from examples in database. Hence, if you add a batch of new images that have no corresponding category, the method will fail in achieving a good result.

0.9 0.8 Percentage

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1

2

3

4

Num ber of Relevant Keyw ords

Figure 3. New image content guess result

3.2.2 The System Performance Result In order to verifying the effectiveness of our system, we have compared it against other state of the art image retrieval systems. We have chosen to compare our method with the retrieval technique used in the CBIR system7.. The comparison is made through 8 sets of random queries with 10 feedback iterations for each set of query and the number of correctly retrieved images is counted after each user feedback. The result is then plotted against the number of user feedback. We ensured that none of the query keywords is labeled on any of the images and that there are exactly 100 images with the correct semantic content in our image database. Since we have used exactly 100 images as our ground truth for each query and that we only actually retrieve 100 images, the value of precision and recall is the same. Therefore, we have used the term “Accuracy” to refer to both in our plot. The result is shown in Figure 5. As we can see from the results, our system achieves on average 80% retrieval accuracy after just 4 user feedback iterations and over 95% after 8 iterations for any given query. In addition, we can clearly see that more relevant images are being retrieved as the number of user feedback increase. Unlike some earlier methods where more user feedback may even lead to lower retrieval accuracy, our method proves to be more stable. It can easily be seen from the above results that by combining semantic level feedback with low-level feature feedback, the retrieval and annotation accuracy is improved substantially.

Figure 4. User Interface

100 90 80 Accuracy

70 60

iFind

50

CBIR

40 30 20 10 0 1

2

3

4

5

6

7

8

9

10

Number of relevance feedback

Figure 5. System Performance comparison

4. CONCLUSION In this paper, we have presented a new framework in which semantics and low-level feature based relevance feedback are combined to help each other in achieving higher retrieval accuracy with lesser number of feedback iterations required from the user. With the help of the retrieval framework, our image annotation strategy is then constructed based on the user’s feedback. The novel feature that distinguishes our framework from the existing image retrieval or annotation approaches in image databases is threefold. First, it introduces a method to construct a semantic network on top of an image database and uses a simple machine learning technique to learn from user queries and feedback to further improve this semantic network. Second, a scheme is introduced, in which semantic and low-level feature based relevance feedback is seamlessly integrated. Third, a useful and efficient annotation strategy is presented base on the framework to obtain an accurate and rapid annotation of any image in large databases. With the help of the strategy, we can easily overcome the shortcomings of totally manual annotation.

REFERENCES 1.

Paek S., Sable C.L., Hatzivassiloglou V., Jaimes A.,Schiffman B.H., Chang S.F., Mckeown K.R, “Integration of Visual and Text-Based Approaches for the Content Labeling and Classification of Photographs”,SIGIR’99. 2. F.Idris and S. Panchanathan Review of Image and Video Indexing Techniques . Journal of Visual Communication and Image Representation. Vol.8 No.2 June .pp146-166 1997. 3. Buckley, C., and Salton, G. “Optimization of Relevance Feedback Weights,” in Proc of SIGIR’95. 4. Ishikawa, Y., Subramanya R., and Faloutsos, C., “Mindreader: Query Databases Through Multiple Examples,” In Proc. of the 24th VLDB Conference, (New York), 1998. 5. Lee, C., Ma, W. Y., and Zhang, H. J. “Information Embedding Based on user’s relevance Feedback for Image Retrieval,” Technical Report HP Labs, 1998. 6. Rui, Y., Huang, T. S., and Mehrotra, S. “Content-Based Image Retrieval with Relevance Feedback in MARS,” in Proc. IEEE Int. Conf. on Image proc., 1997. 7. Rui, Y., Huang, T. S. “A Novel Relevance Feedback Technique in Image Retrieval,” ACM Multimedia, 1999. 8. Salton, G., and McGill, M. J. “Introduction to Modern Information Retrieval,” McGraw-Hill Book Company, 1983. 9. Cox, I.J.; Miller, M.L.; Minka, T.P.; Papathornas, T.V.; Yianilos, P.N , “The Bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments.” IEEE Tran. On Image Processing, Volume 9, Issue 1, Jan, 2000, pp. 20-37. 10. Rocchio JJ (1971) Relevance Feedback in Information Retrieval. In: The SMART Retrieval System, pp. 313-323, Prentice Hall.