Multimedia, IEEE Transactions on

46 downloads 0 Views 444KB Size Report
He is now with Broadcom, Bangalore 560025, India (e-mail: Gaurav.Aggarwal@broadcom.com). Ashwin T. V. and S. Ghosal are with the IBM India Research ...
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 2, JUNE 2002

201

An Image Retrieval System With Automatic Query Modification Gaurav Aggarwal, Ashwin T. V., and Sugata Ghosal

Abstract—Most interactive “query-by-example” based image retrieval systems utilize relevance feedback from the user for bridging the gap between the user’s implied concept and the low-level image representation in the database. However, traditional relevance feedback usage in the context of content-based image retrieval (CBIR) may not be very efficient due to a significant overhead in database search and image download time in client-server environments. In this paper, we propose a CBIR system that efficiently addresses the inherent subjectivity in user perception during a retrieval session by employing a novel idea of intra-query modification and learning. The proposed system generates an object-level view of the query image using a new color segmentation technique. Color, shape and spatial features of individual segments are used for image representation and retrieval. The proposed system automatically generates a set of modifications by manipulating the features of the query segment(s). An initial estimate of user perception is learned from the user feedback provided on the set of modified images. This largely improves the precision in the first database search itself and alleviates the overheads of database search and image download. Precision-to-recall ratio is improved in further iterations through a new relevance feedback technique that utilizes both positive as well as negative examples. Extensive experiments have been conducted to demonstrate the feasibility and advantages of the proposed system. Index Terms—Color image segmentation, content-based image retrieval, intra-query learning, intra-query modification, query refinement, relevance feedback.

I. INTRODUCTION

R

APID growth in the number and size of image databases has created the need for more efficient search and retrieval techniques, since conventional database search based on textual queries can only provide at best a partial solution to the problem. Either the database images are often not annotated with textual descriptions, or the vocabulary needed to describe the user’s implied concept does not exist (or, at least not known to the user). Moreover, there is rarely a unique description that can be associated with a particular image. Thus, recently there has been an immense activity in building direct content-based image search engines. In content based search engines, each image is represented using features such as color, texture, shape or position. A database consisting of feature vectors of all images is creManuscript received April 12, 2001; revised February 26, 2002. The associate editor coordinating the review of this paper and approving it for publication was Dr. Ahmed Tewfik. G. Aggarwal was with the IBM India Research Laboratory, New Delhi 110016, India. He is now with Broadcom, Bangalore 560025, India (e-mail: [email protected]). Ashwin T. V. and S. Ghosal are with the IBM India Research Laboratory, New Delhi 110016, India (e-mail: [email protected]; [email protected]). Publisher Item Identifier S 1520-9210(02)04869-1.

ated. Images in the database that are Nearest Neighbors to the query image according to a similarity metric in the feature space are retrieved. There are two prevalent approaches to measuring similarity: a) geometric distance-based [29] and b) probabilistic likelihood-based [21]. The first step in designing a CBIR system is the selection of an appropriate feature space so that images that are “close” in feature space are also perceptually close to the user. However, fully automatic, rigid approach to image retrieval cannot satisfy the information need of a diverse user population [29]. Therefore, relevance feedback during a retrieval session has emerged as a de facto standard methodology in recent CBIR systems for bridging the gap between the user’s high-level concept (e.g., sunset) and the low-level representation of images in the feature space (e.g., dominant orange-yellow color distribution). Given the user’s preferences to a set of retrieved images, the goal is to learn his notion of similarity by adjusting the parameters of the chosen similarity metric, and improve the relevance of the retrieved images to that user in successive iterations. This process of repeated database search, however, becomes a bottleneck with increase in database size. Furthermore, when the database is located remotely, say over the World Wide Web, downloading irrelevant images in each search iteration significantly slows down the retrieval speed of the image of interest. It is, therefore, desirable to have a capability of understanding user perception from the query image itself at the client-site. This increases the relevance of the images, retrieved from the database, thereby reducing the time required to search the images of interest. In this paper, we propose a new CBIR system called iPURE (Perceptual and User-friendly REtrieval) that incorporates a novel methodology of intra-query modification and learning of user perception at the client-site in addition to relevance feedback in successive iterations. An object-level view of the query image is first obtained using image segmentation. Once the user selects segment(s) of interest, a set of modified images is automatically generated by the system at the client-site. Initial user perception is learned based on the user feedback on the set of modifications. A new color image segmentation algorithm has been developed that is reasonably accurate and at the same time fast for an interactive application. In addition, a novel relevance feedback technique that explicitly uses both positive and negative examples to improve retrieval performance is incorporated in this system. A. Relationship With Existing Retrieval Systems All CBIR systems essentially perform search and retrieval operations in image databases using image centric features like

1520-9210/02$17.00 © 2002 IEEE

202

color, texture, shape, position, etc., rather than the traditional way of searching annotated databases using keywords. Individual systems have typically focussed on some key issues of this multidisciplinary problem. In order to put the proposed iPURE system in right perspective, we present a high-level categorization of various image retrieval systems in Fig. 1 based on three criteria viz., segmentation, relevance feedback, and query modification.1 Research activities in CBIR has progressed in three main directions. Initial systems built using carefully selected global image features and fixed similarity metrics, perform well for retrieving images containing “stuff” and “scenes” [12], [26], [34] where the entire image is relevant. However, these systems are not suited for searching objects where large parts of the image are irrelevant. Thus, a second class of systems have been built around image segmentation. These systems are designed to retrieve “things” instead, by extracting local segment features [31], [14], [7], [20] and matching the object-level view of database images using some predetermined similarity metric in terms of segment features. Both region-based and contour-based shape features have been investigated for object-based image retrieval. While region-based features are more robust, contourbased features can result in more precise retrieval especially in presence of occlusion [31], [14]. CBIR systems with predefined notion of similarity, irrespective of the features they employ, are efficient for searching homogeneous image databases, where the notion of perceptual similarity during database querying is implicitly obvious. An example query image always leads to the same set of retrieved images even though different users may have different requirements. The performance of image-centric retrieval systems with fixed similarity metric is not satisfactory essentially due to the gap between the user’s implied concept, and low-level visual features, and also due to the inherent subjectivity of human perception. MARS [29] is one of the first retrieval systems to employ relevance feedback, and estimates the user perception of the query image by weighting low-level features. PicHunter [9], FourEyes [22], SurfImage [23], and ImageRover [32] are some of the other CBIR engines that employ relevance feedback. All these systems use global features for database indexing and retrieval. We propose a novel methodology here that uses automatic modifications of the query image itself and learning of user perception through intra-query learning on this set of modified images. Since the images are generated at the client-end itself, it saves the database search and image download time which is a significant overhead in the current relevance feedback based image retrieval paradigm. In comparison, VisualSEEk [34] only allows a user to manually modify the global features of the image. Our technique modifies the retrieval parameters and synthesizes training images automatically. User responses on this set of modified images is used to learn the initial weights of retrieval parameters. In comparison, systems like Blobworld [7] and QBIC [12] depend on user’s assignment of weights to different image properties, e.g., color, texture, shape, position etc. 1Note that distances of various systems from the origin do not bear any special significance, and Fig. 1 is for bringing out the key features in some of the current systems and our relationship to these systems.

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 2, JUNE 2002

Fig. 1.

High-level classification of CBIR systems.

A typical user is expected to be oblivious to these image-centric terms. Actually, it is not easy even for an image processing expert to properly specify the feature weights without a priori knowledge of the retrieval strategy and feature distribution in the database. Furthermore, our system uses an object-level view of the image via image segmentation and hence, allows object-level modifications and retrieval of “things.” The rest of this paper is organized as follows. The overview of the proposed iPURE system is presented in Section II. The iPURE segmentation technique and feature extraction are described in Section III. We present the iPURE intra-query modification methodology in Section IV. Section V deals with the iPURE intra-query learning and relevance feedback technique. The effectiveness and performance of the iPURE system is reported in Section VI. Finally we summarize the present study and mention some future research issues in Section VII. II. iPURE SYSTEM ARCHITECTURE The iPURE system is a segmentation-based image retrieval system that works in a client-server environment. The system architecture is shown in Fig. 2. Database creation and updation is an offline task where images are first segmented using a new reasonably accurate yet fast color segmentation scheme. Feature-based segment descriptors as well as spatial segment descriptors are extracted for each segment in the image. These feature vectors are ingested along with the segmentation results into the database. The server segments the query image given by the user in real-time (in interactive sense) and sends the segmentation results to the client. The user can select one or more segments of interest and can then either proceed directly to searching the database, or go through the intra-query modification and learning. Our retrieval strategy for multiple segments is similar to that of Blobworld [7] and VisualSEEk [34] and consists of two steps. In the first step, images containing the most similar segments in terms of individual similarity are identified. Each segment is equally weighed to compute the rank of the image. In the second step, spatial relationship among the segments is used to rank the images identified in the first step. For a single query segment, only the first step is performed.

AGGARWAL et al.: IMAGE RETRIEVAL SYSTEM WITH AUTOMATIC QUERY MODIFICATION

Fig. 2.

203

Architecture of the iPURE image retrieval system.

One of the key innovations in iPURE system is the intra-query modification and learning phase in which feature re-weighting and query redefinition are performed before searching the database. The client Java applet generates a set of images by modifying the values of the retrieval features of the query segments. The user feedback on these modified images is sent to the server which employs relevance feedback techniques to estimate the retrieval parameters of the similarity metric used for ranking segments in the database. Modifications also enable the user to redefine the query point by accepting a modified image and giving a negative feedback on the original query image itself.The learned retrieval parameters are used during the database search and the top- nearest neighbors to the query point are sent to the client. The user may again give feedback on the retrieved images that will be used to further refine the retrieval parameters and search the database. Successive iterations of this relevance feedback loop help the user to converge to her requirements. III. IMAGE SEGMENTATION IN iPURE The proposed iPURE system employs a novel color image segmentation scheme, that is fast and reasonably accurate in order to support query-time generation of the object-level view of a user-provided initial query image and modification of the same during an interactive retrieval session by leveraging the strengths of region-based and edge-based approaches. This gives the user a capability to start the search from his own query image. Edge-based segmentation alone suffers due to presence of spurious and missing edge pixels, whereas stand-alone region-based approaches, in general, offer limited performance due to difficulty in finding initial seed points in image space or modes in feature space. Thus, integrated segmentation

techniques that incorporate both edge-based and region-based information have been proposed for both gray-level and color images. One of the earliest integrated segmentation technique for gray-level images has been proposed by Pavlidis et al. [25]. In this approach, segments are obtained by using a region-growing approach, and then edges between regions are eliminated or modified based on contrast, gradient and shape of the boundary. Interested readers are referred to [24], and [33] for relatively recent surveys of image segmentation techniques. In the context of color image retrieval, Saber and Tekalp et al. proposed a new integrated method for combined color image segmentation and edge linking [30]. In their proposed method, initial segmentation is obtained using color information alone. Spatial contiguity of resulting segments is ensured by employing a Gibbs random field model based segmentation map. Next, spatial color edge locations are determined. Finally, regions in the segmentation map are split and merged by a region-labeling procedure to enforce their consistency with the edge map. Fan et al. have also recently reported an integrated color segmentation method for automatic face detection. Color edges are first obtained by combining an isotropic edge detector and an entropic thresholding technique. Significant geometric structures in the edge map are analyzed for generating initial seeds for seeded region growing. The results of edge processing, and region growing are finally integrated to provide homogeneous image regions with accurate and close boundaries. The basic idea behind the iPURE approach is to generate an initial over-segmentation of the image by detecting the dominant color features, and then merge nonobvious contiguous segments using edge information. In order to generate perceptually acceptable segments and to aid query modification, the segment boundaries are further regularized by solving a quadratic optimization problem.

204

A. Feature-Driven Color Region Segmentation The iPURE system first generates region-based segmentation of the image by analyzing the global histogram of the intensities in the LUV space. LUV color space is chosen for its approximate uniformity in perceptual sense, and its ability to decouple illumination and color information. A simple nonparametric procedure using mean shift algorithm [8] is used for estimating density gradients and finally to robustly determine the dominant color modes of the histogram. Segmentation parameters are chosen to generate an over-segmentation of the input image. B. Edge-Based Color Region Merging Since a histogram essentially captures the global characteristics of an image, a relatively large segment with a gradual change in color is frequently segmented into multiple perceptually nonobvious segments. There is an inherent difficulty in finding modes of the histogram along with preset values of minimum and maximum segment-size that can affect the image segmentation process. Sufficient under-segmentation of the image can merge these segments, but only at the expense of unacceptable merging of segments with distinct colors with one another. iPURE performs edge-based post-processing of the over-segmented results to properly merge the segments. First, a histogram of gradient magnitudes, obtained using Sobel operators is computed at segment boundary pixels for and . A high threshold, is obtained so each of that about 10% of the total boundary pixels have gradient for each of and . Pixels magnitudes greater than with gradient magnitude greater than this high threshold indeed correspond to perceptually obvious strong edges in a wide is chosen to be a variety of images. Now a low threshold, moderate, around 40% of the high threshold, . A lower grais needed only in poorly captured images dient magnitude with very wide local contrast variations (i.e., partially overand under-exposed). Next for each segment , the fraction, of the boundary pixels with gradient magnitude greater than is computed for each of its contiguous segment . If is , and , it is inferred that less than, say 50%, for each of region-based and edge-based boundaries do not substantiate each other and, hence, the segments and are merged. This idea of threshold selection is very similar in spirit to hysteresis thresholding, originally proposed in [6]. Perceptual quality of the segmentation is further improved using results from vision psychology. It has been established that human eye can not distinguish subtle color differences when the luminance value is either too high or too low. Hence, if the average luminance value of contiguous segments and are both less than or greater than some low and high empirical for , and thresholds, respectively, the fractional threshold is increased from say 50% to 65%. and cannot be selected very reliably for two Note that contiguous segments that are quite close in terms of their color properties. Thus, in order to improve the robustness of the segmentation process, it is ensured while merging two segments that vector difference, of mean color vectors of the candidate segments do not differ by say, more than 20 . In addition, if the

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 2, JUNE 2002

mean color vectors of the candidate segments is less than 10 then the segments are merged irrespective of the edge strengths between them. The vector difference is calculated using the following formula:

where and are the mean colors of two contiguous segments, and . Usually one iteration is sufficient for removing most of the perceptually nonobvious segments have been tested with over from an image. Our choice of 2000 images in the Corel dataset. C. Regularization of Segment Shapes The human eye generally tends to perceive smooth kind of shapes as opposed to sudden local variations in shape. In essence, the human eye searches for “regularity” in an image segment more often than not. However, segments often have complicated contours. Inherent imaging noise, as well as difficulty in histogram mode finding and threshold setting during edge-based post-processing further lead to even less perceptually desirable image segmentation. In the iPURE system, segment boundaries are regularized by formulating a quadratic optimization problem that minimizes the total length of segment boundaries, while constraining that the segments themselves are minimally modified. denote an image pixel, and Let represent a segment. Then, if the pixel belong to the -th segand . This constraint correment, sponds to the constraint function (1) The first term guarantees that a pixel belongs to only one segis either 0 or 1. ment, while the second term ensures that be the number of pixels in the original -th segNow, assume ment. Thus we need to minimize the cost function :

(2) denotes the eight-connected neighbors of pixel . where The second term is minimized if the original segment sizes are not changed, whereas the third term is minimized if the neighbors of each pixel belong to the same segment. Thus, minimization of the third term would lead to a single segment in the image, which in turn results in high penalty associated with the second term. We solve this optimization problem, described by (2) by mapping it onto a Hopfield type neural network as described in [27]. Note that the edge-based processing and the shape regularization process may be combined in the cost function by properly incorporating the image gradients in the third term of (2). A

AGGARWAL et al.: IMAGE RETRIEVAL SYSTEM WITH AUTOMATIC QUERY MODIFICATION

(a)

(b)

(c)

(d)

Fig. 3. Post-processing of color-based segmentation results for an image containing horses. (a) Original image, (b) over-segmented, (c) region-merged, and (d) shape regularized.

sparse implementation of Hopfield network, proposed in [13], [27] is employed to reduce the complexity of Hopfield state up, where is the number of segment boundary dation to pixels. The states of the network are initialized according to the initial edge-processed segmentation. Fig. 3(a) shows the original image of multiple horses of different colors and sizes, taken from the “horses” category of Corel stock pictures. Fig. 3(b) shows the twelve segments generated by the mean-shift algorithm. Over-segmentation has resulted in the background area being over-segmented into eight segments due to the shading and illumination effects. Because of strong shadowing effect, part of the larger horse has been merged with the head part of the smaller horse segment. Edgebased post-processing merges the nonobvious background segments into one background segment that actually correspond to human perception, as shown in Fig. 3(c). Finally, the segment boundaries are regularized, i.e., smoothened, as shown in Fig. 3(d). This regularization of the segment boundaries aids in feature extraction and shape matching. The effectiveness of the proposed integrated segmentation technique and post-processing is further demonstrated using other complex outdoor scenes, as shown in Figs. 4 and 5. The entire segmentation process, including the mean-shift algorithm and edge-based post-processing, takes around 5 s for a 128 192 image on a standard workstation. Edge-based post-processing and shape regularization are dependent on the number of segment boundaries in the region-based segmentation. For a wide variety of stock photography images, it takes 1–2 s for edge-based post-processing, and around 1–2 s for 20 iterations of Hopfield network. Note that without the initial segmentation and the sparse implementation of Hopfield network, it takes tens of minutes of execution time for Hopfield network to generate an acceptable segmentation.

205

(a)

(b)

(c)

(d)

Fig. 4. Segmentation results for a beach scene. (a) Original image, (b) oversegmented, (c) region-merged, and (d) shape regularized.

(a)

(b)

(c)

(d)

Fig. 5. Color-based segmentation of a mountain image. (a) Original image, (b) over-segmented, (c) region-merged, and (d) shape regularized.

D. Feature Extraction and Database Creation Segmentation provides an object-level view of an image, i.e., an image can be represented by the union of segments gener-

ated by the iPURE segmentation module. Thus, an image in the database is represented by a set of segment descriptors associ-

206

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 2, JUNE 2002

ated with its constituent segments. Matching and retrieval are performed based on the segment descriptors. The feature-based segment descriptor in the iPURE system currently comprises of average LUV color of the segment, position ( and centroid coordinates w.r.t. image), size (number of pixels in the segment), orientation axis, and three shape moment-invariants. The shape moments in iPURE are calculated using the normalized central that are defined as moments, where

for

using the central moments,

A. Single Segment Modification

, that are defined as where and

The three shape moments, , are computed using these normalized central moments as follows [18]:

(3) This set of moments is invariant to translation, rotation and scale change. The orientation angle, defined as the angle of axis of the least moment of inertia, is computed as follows [18]: (4) The spatial segment descriptor comprises of spatial extent (coordinates of the minimum bounding rectangle), and contiguity information (which segments touch the segment at the boundary). Minimum bounding rectangles are used to determine top-down, left-right relationships. Normalization is performed for each feature dimension independently based on characteristics of its histogram to make the ranges approximately equal. We are currently investigating more sophisticated feature extraction and representation techniques for improving the retrieval performance [10]. The iPURE system uses the Mahalanobis distance as the similarity metric for matching segment descriptors which is defined as where

by modifying the retrieval features of the query segment(s). The user feedback on this set of synthetic images is used to generate an initial estimate of the matrix and query point, in (5), which are then used to search the database. This substantially improves the retrieval results since the estimated ellipsoid approximates the user’s requirements. The user may give feedback on the retrieved results that will be used to further refine the ellipsoid to the user requirements through relevance feedback techniques. The techniques used to generate the modified images are described in this section.

(5)

and denote the feature vector of a where database segment and the query segment respectively and is the inverse of the correlation matrix. IV. iPURE QUERY MODIFICATION Traditionally, relevance feedback based retrieval systems have used images retrieved from the database to obtain users feedback. In contrast, the iPURE system generates a set of training examples from the query image itself at the client-site

When the user selects a single segment in the query image to start the search process, modifications are made in the color, position, size, and orientation features of the selected segment and corresponding images are synthesized automatically. The retrieval parameters are then estimated from the corresponding user responses to the modified segments. To effectively prune the number of modifications, a large change is made in each independent dimension of the feature vector. Acceptance of an image with a large modification in a feature dimension implies low relevance of the corresponding feature in user perception. A perceptually large change in average color is achieved by simply bit flipping (1’s complement of the original) the image in the RGB domain. LUV transformation of the new RGB value becomes the color component of the feature vector of the modified segment. The bit-flipping procedure ensures that the original texture is preserved in the modified segment. If the original query and the color-flipped modified image are both acceptable to the user, then the acceptance of the color during retrieval is reduced. Position modification for the chosen segment includes displacing it to extreme horizontal and vertical positions within the image. The selected segment is displaced so that it remains entirely within the image. If a user finds such an image acceptable then it implies that segment position has less importance in the user perception and consequently the weights for the centroid dimensions are lowered. Size modification includes scaling the segment about the segand . For ment centroid by factors of the largest value of , the modified segment fills one of the dimensions of the image. If the selected segment touches either the horizontal or vertical boundaries, no scaling up is possible. % of its In this case, the segment is scaled down to original size. Different scaling factors can also be used for and -dimensions. Acceptability of size modified images implies irrelevance of size dimension for the user. Orientation modification includes rotating the selected seg. If the modment about the segment centroid by an angle of ified segment is acceptable to the user, the weight of orientation dimension is lowered during retrieval. The system assumes orthogonality of features in order to prune the modification space. Strictly speaking, if the features are integral then independent modifications may not be sufficient to capture the user perception accurately. In such a case, additional modified images in which two or more features are simultaneously modified would be needed. Obviously, there is a tradeoff between the amount of user feedback desired for better

AGGARWAL et al.: IMAGE RETRIEVAL SYSTEM WITH AUTOMATIC QUERY MODIFICATION

207

understanding of user perception and user patience for giving feedback on modified images. B. Multiple Segment Modification Typically, segmentation-based CBIR systems treat multiple segment queries as a simple Boolean AND/OR of the query segments. However, simple Boolean AND/OR alone is not sufficient to describe multiple segments. Often, there is a semantic relationship between the segments of interest since the segmentation algorithm generally does not segment semantic objects in an image as a single segment. A user may, thus, select multiple segments to capture his notion of the query object. For example, a color-based segmentation will segment a multicolor national flag into multiple segments, and thereby destroying the object semantics. The iPURE system performs multiple-segment modifications to estimate the relative importance of the spatial relationships, and perform query expansion, if needed. Independent modifications of features of each segment generate a large number of modified images. The concept of a semantic object is employed in iPURE for pruning the set of modifications. Frequently, when the user selects multiple segments of interest, these form a semantic object, where different segments satisfy some spatial constraints. A semantic object represents a group of segments whose relative shape, size, and spatial organization as a whole defines the user’s notion of query object. For example, when a user selects the multiple segments of a multicolor flag, generating independent modifications for these segments is not at all efficient for learning the retrieval parameters. Our proposed solution is to infer from user’s feedback whether the multiple segments of interest form 1) a semantic object where relative shape, size, and spatial organization of segments are strictly preserved, e.g., a multicolor flag, 2) multiple objects of interest, e.g., an apple and an orange, and 3) an object and background, e.g., sun and sky. Thus, when multiple segments are of interest to the user, initial modification is made to verify the hypothesis that the segments belong to one of these three cases. be the 1) Manipulation of Segment Contiguity: Let number of selected segments by the user. Define an incidence , of size where is the number of matrix, adjacent pixels between segment and segment . Note that is a symmetric matrix and is equal to the size of the segment . A scheme for reducing contiguity between segments of interest is given as follows. Step 1) Identify any noncontiguous segment, , such that . Remove the -th row and column of the incident matrix . Decrement . If stop. Step 2) Find maximally contiguous segment, . If more than one exist, choose the larger segment. In case of a tie, choose a segment randomly. Step 3) Compute the convex hull of the remaining segments. Move the maximally contiguous segment outside the convex hull.

Fig. 6. Distorting semantics of object. (a) Segments of interest form a triangular shape in the query image. (b) The maximally contiguous segment moved in the modified image. (c) Selected segments are scaled down about their centroids. (d) Segments scaled down about their contact points with the maximally contiguous segment.

Step 4) Remove the -th row and column from the ma, stop. Else go to trix . Decrement . If step (2). One limitation with the above scheme is that in some cases, the maximally contiguous segment cannot be moved outside the convex hull, while keeping it entirely within the image. Another approach guaranteed to reduce contiguity is to scale down the selected segments about their centroids as shown in Fig. 6(c). However, scaling does not preserve the initial size of the selected segments. A third alternative is to keep the maximally contiguous segment intact and scale down the other segments about a contact point with the maximally contiguous segment as shown in Fig. 6(d). This maintains contiguity but distorts the semantic shape of the multiple segments. If the modified images, shown in Fig. 6 are not acceptable to the user, a semantic segment is created, and single segment modification scheme is applied to the semantic segment. Additionally, a modification is performed to estimate the importance of shape of the semantic segment for retrieval. The average color of the semantic segment is modified to be the average of colors of selected segments. The importance of the semantic shape is increased over the individual color of the segments if user is insensitive to this modification. If the original query is also relevant to the user, then query expansion is performed. In case the color modified image is not acceptable to the user, the importance of individual colors is maintained for retrieval. When the modified image, shown in Fig. 6(b) or (c) is acceptable to the user, it is inferred that selected segments are independent, and single segment modification scheme is applied to each of these. The importance of contiguity is reduced during retrieval. Importance of top-down, left-right relationship is estimated by interchanging the position of selected segments. The importance of enclosure during retrieval is estimated by moving the enclosed segment outside the segment that encloses it. The iPURE system also prunes the modification space by not performing scaling, translation, and rotation for segments that are labeled as background regions using a set of heuristics [2].

208

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 2, JUNE 2002

V. iPURE INTRA-QUERY LEARNING RELEVANCE FEEDBACK

AND

Efficient CBIR systems employ learning techniques based on user feedback to improve retrieval performance. MindReader [17] formulated learning of user perception as an optimization problem to estimate parameters of the distance metric to minimize the sum of distances of relevant examples from the query. Using the quadratic distance metric where

(6)

the problem is written as

is among the few systems that explicitly use nonrelevant examples for estimates parameters of parametric models. Further, the iPURE system requires the user to provide only “good,” i.e., relevant, and “bad,” i.e., nonrelevant, labels. The iPURE relevance feedback algorithm computes the relevance scores, using the labeled examples to ensure that the new parameter estimates are pushed away from the nonrelevant examples. In the iPURE system, the user provides feedback either on a set of images generated automatically by modifications of the query image at the client side, or on the set of top- images retrieved from the database. Section V-A describes the algorithm used to learn from users feedback on modified query images. Section V-B describes the algorithm used to estimate the similarity metric when user provides feedback on database images. A. Intra-Query Learning on Modified Images

subject to where represent the relevance scores that a user associates with the relevant images. It has been shown that the minimum volume ellipsoid is the optimal solution of the above problem. The optimal parameters are obtained as (7) (8) (9) where is the weighted covariance matrix. MARS [29], one of the earliest relevance feedback based CBIR systems, assumes feature independence, i.e., a diagonal to reduce the similarity metric to a weighted Euclidean distance. Clearly, a MindReader images on which feedbased approach would require back would need to be provided to compute the values for the matrix. Hence, Rui [28] proposed a hierarchy of complete features that limits the interaction between features. This significantly reduces the number of parameters to be learned which results in more robust parameter estimates. However, these systems learn only from the images marked as relevant and ignore the information provided by the nonrelevant images, e.g., the MindReader system requires that the relevance scores, necessarily be positive, i.e., the examples need to be relevant. Typically, these algorithms converge rapidly without adequate exploration of the feature space, i.e., the number of relevant examples retrieved saturates in a few iterations and at the same time some nonrelevant images continue to be retrieved in successive iterations. Further, they expect a user to assign the relevance score to each relevant image. We believe that this is similar to asking the user to coherently rank the images which may be a nontrivial task even for an experienced user. Others like Nastar [21], Brunelli [5] incorporate nonrelevant examples also along with the relevant examples for learning user perception using nonparametric models. We propose a new relevance feedback algorithm in the iPURE system that uses both relevant as well as nonrelevant examples to learn user perception. To the best of our knowledge, iPURE

The number of relevant images is expected to be small during user feedback on intra-query modifications. Hence, the iPURE system employs a special case of (5) with a diagonal . Thus, the similarity metric reduces to weighted Euclidean distance, where the weights, , denote i.e., the relative importance of the -th feature. The system adapts to capture variation in the relevant examples along each of the feature dimensions. These weights are updated as , where is the standard deviation of the -th feature dimension for the relevant segments and is an empirically determined constant. The query point is re-defined as the mean of the feature vectors of relevant segments. These updated weights and query point are used as a starting point to search the database. This quick estimate of the retrieval parameters even before the database is searched, increases the precision of images retrieved during the first database search. This is especially the case when different users may have different perception about the same query, e.g., a user searching for “round” objects may start with a query image that has a “whiteround” object in it as shown in the first image of Fig. 7(a). In the absence of any intra-query modifications, the database search would retrieve all “white-round” objects since all feature dimensions, i.e., color, shape, size, would be weighed equally. However, when intra-query modifications are used as shown in Fig. 7(b) and the user finds the color modification acceptable, the system learns the relative unimportance of color. The system reduces the weight of color features and hence, retrieves more “round” objects during the first database search as shown in Fig. 7(c). The number of “round” objects in the first database search increases to nineteen from thirteen. The improvement in retrieval performance is demonstrated in Table I. Intra-query modification and learning reduces the time required to retrieve 22 relevant objects in the top-25 retrieved images from 0.15 s (three iterations) to 0.09 s (one iteration). When the query image contains multiple segments of interest, the iPURE system learns the notion of a semantic object as demonstrated when the user is searching for “mushrooms” starting from the query image shown in Fig. 8(a). When the user selects the two segments that comprise the mushroom, the iPURE system distorts the object semantics by scaling down the two segments about their centroids as shown in Fig. 8(b). The system infers that the two segments form a

AGGARWAL et al.: IMAGE RETRIEVAL SYSTEM WITH AUTOMATIC QUERY MODIFICATION

209

(a)

(b)

(c)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

Fig. 8. Semantic object modifications. (a) Query image with two query segments. (b) Contiguity broken to distort object semantics. (c) Color flipped. (d) Color average. (e) Horizontal translation. (f) Vertical translation. (g) Scale up. (h) Scale down. (i) Rotation.

Fig. 7. Retrieval of “white-round” object with and without intra-query learning. (a) Retrieval results without intra-query learning when user looking for “roundness” but starts from a “white-round” object. (b) Automatic modifications for the “white-round” segment. (c) More “round” objects retrieved after intra-query modification and learning. TABLE I PERFORMANCE COMPARISON OF INTRA-QUERY LEARNING AND RELEVANCE FEEDBACK. (a) NUMBER, , OF RELEVANT RETRIEVED IMAGES (IN TOP 25) WHEN THE USER IS ACTUALLY INTERESTED IN ROUNDNESS. (b) TIME SPENT DURING RETRIEVAL (IN SEC)

N

semantic object since the user rejects the modified image. Subsequently, the system generates a set of modified images that includes assigning a uniform color (average of the two segment colors) to both the segments, individual color-flip, and -directions, combined combined translation in both scale-up and down and combined rotation. Since the user finds both the original and the uniformly gray colored objects as relevant, but not the color-flipped bluish one, a query expansion is performed. Database search is done on both the original two-segment query mushroom and the single-segment gray mushroom. The precision improves substantially since the system matches the shape of the two segment semantic object, i.e., the mushroom, to the retrieved segments and not just two independent segments.

B. Relevance Feedback on Images Retrieved From the Database In further iterations, when the database has been searched for the top- matches and the number of training examples is larger, the iPURE system does not place the diagonal restriction on the matrix in (5). The nonzero off-diagonal elements in may be robustly estimated as described in [17], [28]. In a heterogenous collection of images, the features employed to represent the image segments do not accurately capture the visual perception of all users. This results in some relevant segments appearing close to the nonrelevant segments in the feature space even though the user perceives them as being far apart. To address this limitation, the iPURE algorithm selects a fraction of the relevant examples for estimating the retrieval parameters. Relevant examples that are farther from the nonrelevant examples are chosen over those which are closer to nonrelevant examples. This selection process ensures that the new estimates do not capture the nonrelevant examples. The sampling of relevant examples is achieved by estimating the relevance scores , (7) to ensure that the resulting similarity metric best represents the relevant examples, the details of the algorithm are presented in the next section. 1) Selecting Optimal Subset of Relevant Examples and Computing Their Relevance Scores: Selection of the right subset of relevant examples to estimate parameters have been explored in literature, Jolion [19] proposed a random sampling based approach for Minimum Volume Ellisoid estimation for clustering problems. The iPURE system uses a greedy algorithm that selects the examples and explores the feature space by changing associated with the relevant examples. The relethe scores vance scores along with the similarity metric parameters are obtained through an iterative process. Points in the feature space at a fixed distance from the query lie on an ellipsoid

210

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 2, JUNE 2002

(a)

(b)

Fig. 9. Proposed relevance feedback algorithm for estimating the similarity metric.

owing to the quadratic nature of the distance metric. The algorithm updates the parameter estimates to ensure that the ellipsoids representing the similarity metric better captures the relevant examples while excluding nonrelevant examples. The pseudo code for the algorithm is shown in Fig. 9. The are initialized to 1, i.e., all the relevant examples are scores considered to be equally important. In each iteration, the simiare determined using the current larity metric parameters scores . The distances of the relevant and the nonrelevant examples from the target concept are determined using (6). Let denote the farthest relevant example that has a nonzero be its distance from . Let be the ellipsoid score and and having a radius . In most cases it defined by is observed that some nonrelevant examples fall inside . These are the examples which are most likely to be retrieved in the next are used. Therefore, iteration if the current estimates of the parameters are adjusted to move away from such examrepresent the set of nonrelevant examples inside ples. Let . The algorithm modifies the parameters in each itera. This is achieved tion to reduce the number of examples in is as follows. The score of the farthest positive example set to 0. The scores of the other positive examples with nonzero scores are updated as the sum of their quadratic distances from . The updation of the scores causes the new the examples in to move away from the examples in . Assigning a zero score to the farthest example leads to a shrinking of ensuring termination. The updated scores are then used to obtain a new

(c)

(d) Fig. 10. Retreival results for retrieving horse images from Corel dataset. Segments relevant to user marked with dark gray border, nonrelevant segments marked with light gray border. (a) Without intra-query modification, only brown horses are retrieved in first database search. (b) User feedback on automatic modifications of a horse segment. User accepts color modifications and translations (marked dark gray border) but rejects rotated horse (marked in light gray border). (c) With intra-query modification and learning, both white and brown horses are retrieved. (d) Higher precision after one iteration of relevance feedback on horse query.

estimate of the parameters and the iteration proceeds. The iterareduces to zero. tion stops when the number of examples in are used to rank imThe final estimate of the parameters ages in the database and top- images shown to the user. In [3], we demonstrate the improved performance of the proposed sampling-based learning algorithm for a two-dimensional synthetic dataset. In the next section we present results for the Corel dataset.

AGGARWAL et al.: IMAGE RETRIEVAL SYSTEM WITH AUTOMATIC QUERY MODIFICATION

Fig. 11.

211

(a)

(b)

(c)

(d)

Precision and recall for horse and eagle categories; with and without intra-query learning of feature weights (IQM stands for Intra Query Modification).

VI. RESULTS The iPURE system is tested on a database of 2200 Corel stock photographs of size 128 192 pixels from varied categories such as sunsets, horses, flowers, mountains, everyday objects, highway signs, beaches, eagles etc. The entire database is segmented offline to create a database with 18 547 labeled segments. We present the retrieval performance of the iPURE system in this section and compare the results with those of the existing algorithms. In all experiments, the hierarchical similarity model proposed by Rui et al. [28] is used as the distance function for the proposed relevance feedback algorithm. We choose Rui’s metric since the number of parameters to be estimated is considerably smaller than that of MindReader [17], also the segment level descriptors we have employed naturally fall into distinct uncorrelated groups (e.g., color, shape, position, size). In the absence of intra-query modification and learning, each feature dimension is equally weighed. In such a scenario, when the user presents a centered brown “horse” segment as the query, the system retrieves only brown colored segments centered in the image, as shown in top-18 retrieval results of Fig. 10(a). However, a user looking for “horses” would find white horse segments, horse segments that are not centered in the image or

of varying size acceptable. In contrast, if intra-query modification is employed, the system generates a set of modified images as shown in Fig. 10(b). Since the user finds the color-modified horse acceptable (shown in dark gray), the system learns the relative unimportance of color in user perception and reduces weights for the color features. The user also accepts the position and size modified images that results in reduction of weights for the position and size features. Thus, in response to such user feedback on the set of intra-query modification, shown in Fig. 10(b), weights for color, position and size feature components are reduced. However, the weight for the orientation feature remain unchanged but its effective importance increases as other feature weights have been reduced. Hence, after feature re-weighting the system retrieves both brown as well as white colored horse segments with moderate variations in position and size as compared to the query segment, as shown in Fig. 10(c). Further, the number of nonrelevant images reduces to just one after a single iteration of the iPURE relevance feedback algorithm as shown in Fig. 10(d). Precision and recall values have been used in literature to measure performance of image retrieval systems. Recall is the ratio of the number of relevant images returned to the total number of relevant images in the database. Precision is defined

212

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 2, JUNE 2002

(a)

(b)

(c)

Fig. 12. Number of images retrieved in successive iterations of relevance feedback (PRF represents Proposed Relevance Feedback algorithm, IQM stands for Intra Query Modification).

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 13. Precision and Recall after one iteration of relevance feedback (PRF represents Proposed Relevance Feedback algorithm, IQM stands for Intra Query Modification).

as the ratio of the number of relevant images returned to the total number of images returned. The precision and recall performance of the iPURE system for two classes of natural images—horses and eagles is shown in Fig. 11. Clearly, there is a significant improvement in both precision as well as recall when intra-query modification and learning is employed. The iPURE system further improves the retrieval performance in successive iterations of relevance feedback as compared to existing algorithms. The number of relevant images in the top-50 retrieved images for queries from four different categories—horses, eagles, highway signs—are shown in Fig. 12 for successive iterations. The iPURE algorithm “PRF with IQM” uses feature re-weighting through intra-query modification and learning in the zeroth iteration and the proposed

relevance feedback algorithm from first iteration onwards. The “PRF without IQM” uses only the proposed relevance feedback algorithm without intra-query modification. “Rui” implements the similarity model proposed in [28] and uses only relevant examples. “MARS” assumes feature independence and estimates the feature weights from the features variances of the relevant examples [29]. These results demonstrate the effectiveness of the proposed intra-query modifications and explicit use of nonrelevant examples during relevance feedback. We further illustrate the benefits by comparing the precision and recall of various algorithms after one relevance feedback iteration for the four category queries in Fig. 13. In particular, intra-query learning improves the recall in the initial iteration and after that the the sampling-based learning algorithm further improves the

AGGARWAL et al.: IMAGE RETRIEVAL SYSTEM WITH AUTOMATIC QUERY MODIFICATION

precision and recall. Further, the recall does not reach unity for all category queries even when 500 images are retrieved since for some category queries, our chosen segment descriptor cannot capture completely the high-level visual characteristics. VII. CONCLUSION A new CBIR system that addresses the subjectivity of user perception during image retrieval is proposed, and its effectiveness is experimentally demonstrated in this paper. Our approach fundamentally differs from the traditional approaches in that it employs a methodology of (client-side) intra-query modification and learning to reduce the need for traditional computeand bandwidth-intensive relevance feedback mechanism. In addition, intra-query modification and learning provides a more effective mechanism for initializing the similarity metric before searching the image database. This methodology can very well be extended to other “query-by-example” based multimedia search scenarios as well. The proposed system is built around image segmentation to enable “object-level” search in the image. We have developed a reasonably accurate and fast color segmentation technique that leverages the strengths of region-based and edge-based segmentations. Also, a new parametric relevance feedback algorithm is proposed that explicitly utilizes information about nonrelevant examples. Modification strategies described in this paper are based on simple heuristics. More effective modification schemes may be developed if feature distribution in the image database is taken into account. Further research is also needed for extending the proposed paradigm to global feature-based image retrieval. Note that while fundamentally similar in nature to well studied classification problems, interactive “query-by-example” based image retrieval paradigm poses new challenges, since the training points are not representative of the database feature vectors. Thus, incremental classification with very few training samples and good generalization needs to be addressed in the context of image retrieval. REFERENCES [1] G. Aggarwal, P. Dubey, S. Ghosal, A. Kulshreshtha, and A. Sarkar, “iPURE: Perceptual and user-friendly REtrieval of images,” Proc. IEEE ICME, Aug. 2000. [2] G. Aggarwal, S. Ghosal, and P. Dubey, “Efficient query modification for image retrieval,” Proc. IEEE CVPR, June 2000. [3] T. V. Ashwin, N. Jain, and S. Ghosal, “Improving image retrieval performance with negative relevance feedback,” Proc. IEEE ICASSP, Mar. 2001. [4] S. Aksoy and R. Haralick, “Feature normalization and likelihood-based similarity measures for image retrieval,” Pattern Recognit. Lett., Special Issue on Image and Video Retrieval, 2000. [5] R. Brunelli and O. Mich, “Image retrieval by examples,” IEEE Trans. Multimedia, vol. 2, pp. 164–171, Sept. 2000. [6] J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-8, Nov. 1986. [7] C. Carson, S. Belongie, and H. Greenspan, “Region-based image querying,” Proc. IEEE CAIVL, 1997. [8] D. Comaniciu and P. Meer, “Robust analsis of feature spaces: Color image segmentation,” presented at the IEEE CVPR’97, San Juan, PR, 1997. [9] I. Cox, M. Miller, T. Minka, T. Papathomas, and P. Yianilos, “The Bayesian image retrieval system, PicHunter: Theory, implementation and psychophysical experiments,” IEEE Tran. Image Processing, vol. 9, pp. 20–37, Jan. 2000.

213

[10] A. Del Bimbo, Visual Image Retrieval. San Francisco, CA: Morgan Kaufmann, 1999. [11] J. Fan, D. K. Y. Yau, A. K. Elmagarmid, and W. G. Aref, “Automatic image segmentation by integrating color-edge extraction and seeded region growing,” IEEE Trans. Image Processing, vol. 10, pp. 1454–1466, Oct. 2001. [12] M. Flickner, H. Sawhney, and W. Niblack, “Query by image and video content: The QBIC system,” IEEE Computer, vol. 28, no. 9, 1995. [13] S. Ghosal, J. Mandel, and R. Tezaur, “Automatic substructuring for domain decomposition using neural networks,” Proc. IEEE ICNN, June 1994. [14] B. Gunsel and A. M. Tekalp, “Shape similarity matching for query-byexample,” Pattern Recognit., vol. 31, no. 7, pp. 931–944, 1998. [15] P. Hong, Q. Tian, and T. Huang, “Incorporate support vector machines to content-based image retrieval with relevance feedback,” Proc. IEEE ICIP, 2000. [16] J. Hopfield and D. Tank, “Neural computation of decisions in optimization problems,” Biol. Cybern., vol. 52, 1985. [17] Y. Ishikawa, R. Subramanya, and C. Faloutsos, “MindReader: Querying databases through multiple examples,” Proc. VLDB, Aug. 1998. [18] A. K. Jain, Fundamentals of Digital Image Processing. Englewood Cliffs, NJ: Prentice-Hall, 1997. [19] J. Jolion, P. Meer, and S. Bataouche, “Robust clustering with applications in computer vision,” IEEE Trans. Pattern Anal. Machine Intell., vol. 13, pp. 791–802, Aug. 1991. [20] M. Ma and B. Manjunath, “NETRA: A toolbox for navigating large image databases,” Multimedia Syst., vol. 7, no. 3, 1999. [21] C. Meilhac and C. Nastar, “Relevance feedback and category search in image databases,” Proc. IEEE Multimedia Computing and Systems, June 1999. [22] T. P. Minka and R. Picard, “Interactive learning using a society of models,” Pattern Recognit., vol. 30, no. 4, 1997. [23] C. Nastar, M. Mitschke, and C. Meilhac, “Efficient query refinement for image retrieval,” Proc. IEEE CVPR, 1998. [24] N. Pal and S. Pal, “A review of image segmentation techniques,” Pattern Recognit., vol. 26, pp. 1277–1294, 1993. [25] T. Pavlidis and Y. T. Liow, “Integrating region growing and edge detection,” IEEE Trans. Pattern Anal. Machine Intell., vol. 12, pp. 225–233, Mar. 1990. [26] A. P. Pentland, R. W. Picard, and S. Sclaroff, “Photobook: Content-based manipulation of image databases,” Int. J. Comput. Vis., vol. 18, no. 3, pp. 233–254, 1996. [27] A. Ramamurthy and S. Ghosal, “An integrated segmentation technique for interactive image retrieval,” Proc. IEEE ICIP, 2000. [28] Y. Rui and T. Huang, “Optimizing learning in image retrieval,” Proc. IEEE CVPR, June 2000. [29] Y. Rui, T. Huang, M. Ortega, and S. Mehrotra, “Relevance feedback: A power tool for interactive content-based image retrieval,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, pp. 644–655, Sept. 1998. [30] E. Saber, A. M. Tekalp, and G. Bozdagi, “Fusion of color and edge information for improved segmentation and edge linking,” J. Image Vis. Comput., vol. 15, no. 10, pp. 769–780, 1997. [31] E. Saber and A. M. Tekalp, “Region-based affine shape matching for automatic image annotation and query-by-example,” Vis. Commun. Image Represent., Mar. 1997. [32] S. Sclaroff, L. Taycher, and M. LaCascia, “ImageRover: A contentbased image browser for the worldwide web,” Proc. IEEE CAIVL, 1997. [33] W. Skarbek and A. Koschan, “Color Image Segmentation—A Survey,” Dept. Comput. Sci., Tech. Univ. Berlin, Germany, Tech. Rep. 94-32, Oct. 1994. [34] J. Smith and S. Chang, “VisualSEEk: A fully automated content-based image query system,” presented at the Proc. ACM MM, MA, 1996.

Gaurav Aggarwal received the B.Tech. degree in computer science and engineering from Indian Institute of Technology, Delhi, India, in 1996 and the M.S. degree in information and computer science from the University of California, Irvine, in 1998. He was a Research Staff Member at IBM India Research Laboratory, New Delhi, during 1998–2001. His research interests include image processing, digital video standards, and content-based retrieval systems. He is currently a Senior Design Engineer at Broadcom India, Bangalore, working on various multimedia standards.

214

Ashwin T. V. received the B.E. degree from Karnataka Regional Engineering College, Surathkal, India, in 1998, and the M.S. degree in system science and automation from the Indian Institute of Science, Bangalore, India, in 2000. Since 2000, he has been a Research Staff Member at the IBM India Research Laboratory, New Delhi, where he has worked on image processing tasks like crop area determination and mosaicing of remote sensing images. Currently he is pursuing content based retrieval techniques for IBM products. His research interests include relevance feedback for content based information retrieval, multimedia databases and pattern classification.

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 2, JUNE 2002

Sugata Ghosal received the B.E. degree in electronics and telecommunication engineering from Jadavpur University, Calcutta, India, in 1988 and the Ph.D. degree from the University of Kentucky, Lexington, in 1993. He is a Research Staff Member and a Manager at IBM India Research Laboratory in New Delhi. He conducts research in image analysis and multimedia information systems. Prior to joining IBM, he was a Researcher at the Algorithm Research Center of Sony Electronics at San Jose, CA, and a principal investigator in several U.S. DoD sponsored foveal vision projects at Amherst Systems (currently a subsidiary of Northrop Grumman, Inc.), Buffalo, NY. He has published over 35 papers and holds over 15 U.S. patents in error-resilient video compression, coding, and reconstruction.