A search through visual media should be as imprecise as

0 downloads 0 Views 2MB Size Report
a similar appearance value, characterized by a combina- tion of their color and texture. If we query on visual object 1, should the system retrieve visual object 2?
Shaman with elephant, Sandawe, Tanzania, Tanzania National Museum

34

December 1997/Vol. 40, No. 12 COMMUNICATIONS OF THE ACM

A search through visual media should be as imprecise as ‘I know it when I see it.’

In Search of Information in

Visual Media Information is data with semantic association. An information system’s most important role is to capture the data and its semantic associations so users can perform meaningful tasks with the information. An information system’s design and functionality change along with changes in the nature of the data, ART RESOURCE: WERNER FORMAN ARCHIVE, TANZANIA NATIONAL MUSEUM, DAR ES SALAAM

the nature of data association, and the task the user is trying to perform. For example, temporal interval data involves different semantics from those of a row of floating-point numbers. Similarly, users of On-Line Analytical Processing database software need a different set of operations from those used in relational systems, although the underlying data might be the same. In a perfect world, given any kind of data, semantics, and operational requirement, a suitable information system can be designed and built. Unfortunately, that is not reality. Reality is that information systems are most successful when the data has human-imposed structure. Systems derived from the relational model, for example, provide users a powerful set of tools to specify the domain of every attribute and the semantic associations between them. An application schema created through these tools constrains the data that fits the problem domain. The resulting database is semantically rich because the system’s developer takes the trouble to ensure that every attribute of the model has a well-defined interpretation and that the

dependencies between attributes faithfully reflect the real problem world. A successful information system gives the user enough capability to define the attribute domains, express data associations, and perform an adequate set of retrieval operations. A great case in point is the practical success of spatial information systems, in which the data types are points, lines, and regions in space and operations can be as complex as finding the intersection of two arbitrary polygons in 3D space.

A

second area of moderate success has been information systems in which the data has little structure but the associations are semantically rich. A modern example is the world of hyperlinks represented by the phenomenal application known as the World-Wide Web. With hyperlinks between entities covering such a wide and heterogeneous spectrum as text, documents, images, movie clips, audio files, virtual reality, and databases, the Web COMMUNICATIONS OF THE ACM December 1997/Vol. 40, No. 12

35

v isual i nf ormat ion m a nage me n t v

Amarnath Gupta, Simone Santini, and Ramesh Jain

Figure 1. The larger picture is an airplane. The picture in the inset has the same background texture but with two + shapes. Two similar shapes is enough for many systems to consider the two pictures very similar.

has revolutionized our ability to access unstructured information. With all its success, however, can the Web be called a good information system? We can say it is a good application when users want to browse by navigating hyperlinks or users have prior navigational patterns that take them close to the right information. But trying to locate specific but unknown information via search can be a nightmare. Compared with database and hypertext-based systems, free-text retrieval systems have had mixed success. Most common text-information retrieval systems lack the ability to interpret a user’s intent and instead try to approximate intent through statistical techniques, like word frequency and term cooccurrence. Some systems go a step deeper, trying to group a corpus of terms through interterm association (such as through a thesaurus relating dog to canine and mongrel or through a latent semantic association relating Macbeth to Shakespeare). These systems perform some degree of associative search, so a query on mongrel matches documents with the words dog and canine. However, only a few systems go beyond word structure and occurrence statistics to use natural language processing techniques to extract phrase-level, sentence-level, and intersentence-level associations and use these associations for 36

December 1997/Vol. 40, No. 12 COMMUNICATIONS OF THE ACM

retrieval. Hence, the semantics expressed through intersentence or interparagraph associations can not yet be modeled and retrieved. Given the absence of structure in free text, the task of the system is not only to store and retrieve associations with data but to extract associations from raw text data, thereby generating information. This association extraction is computationally more difficult and often an almost impossible task. However, if an information retrieval system had the ideal extraction component, it could interpret a discourse written over multiple paragraphs and answer semantic queries like, “What is the basic plot of the story?” But few limiteddomain research prototypes have such ability. The issue here is that an information retrieval system that also needs to extract information from raw data (as opposed to being told what the associations are) is inherently weaker, retrieving information suboptimally. This problem has much in common with visual information systems in which the system contains images (or videos) as its primary information object.

The Character of Visual Information As in all these systems, the information in a visual object is its value, that is, a depiction of what it contains, and its association with other visual objects.

However, for visual objects, characterizing informachoosing a user-specified set of dimensions. For tion content becomes more complex than for text. In example, if the feature shape of a visual object text, every word has a finite number of meanings, contains 10 floating-point numbers representing and the correct semantic value of a word, if not Zernike moments, choosing the first five of them immediately clear, needs to be disambiguated (chois a projection. Features like Zernike moments sen from these finite possibilities) through sentenceare frequently ordered arrays, so choosing the level or paragraph-level analysis. first five to match objects retrieves objects that In Figure 1, the semantic value of the visual object match in their overall shape but may differ in is obviously a flying aircraft. Visually, however, the details. picture is quite similar to the small inset image in the • Apply Function. Apply Function takes a feature as upper right-hand corner. This seeming equivalence of input and applies a function F to all numbers to the two visual objects comes about because they have create another set of numbers with the same a similar appearance value, characterized by a combinadimension. For example, a filter function (see Figtion of their color and texture. If we query on visual ure 2) may be applied to the hue histogram of an object 1, should the system retrieve visual object 2? image to compute the redness of the image. The problem would be trivially simple if the system • Distance. Given two features, Distance computes a could label visual object 1 as an aircraft, because then difference value between them, a crucial operation we could match visual objects by their semantic value because for many kinds of visual information, a alone. Extracting the Filter match in the appearance semantic value from the Factor value is defined in terms of 1.0 visual appearance is an the Distance function of the arduous task, complicated constituent features; the by the facts that many greater the distance, the less objects with the same the match. The design of semantic label exhibit a 0.5 the Distance function often very large variety of accounts for the inherent appearance values and that imprecision in matching automatic isolation (segtwo visual objects. One pre0 mentation) of the semandominant effect of treating tic objects from an image features as vectors is that 0 15 30 45 60 75 90 is in itself a difficult probthe Distance function is Hue in Degrees lem. In this article, we often defined as a EuclidFigure 2. Hue is the color content of a pixel, focus mainly on retrieval ean, or city-block, distance based on appearance val- represented with an angular scale from 0 to 360 between two points in feaues of visual objects. For degrees. The perceptual color red appears around ture space. The cosine (or an object-modeling per- 0 degrees, green at 120 degrees, and blue at 240 angular) distance between spective of the retrieval degrees. The curve shows a filter that can be used two features is another vecto compute the redness of an image. problem, see [4]. tor-space measure very popRetrieval based on ular in text retrieval, but it visual object appearance involves four categories of has been used less for visual information. information items: features; feature space; feature groups; and image space. However, features cannot always be thought of as vectors. For example, if the feature represents a disFeatures. Computationally, a feature is a derived tribution (or histogram) of a variable measured in the attribute obtained by transforming the original image, the distance function needs to compute the visual object through an image analysis algorithm; it difference between two distributions. A classic discharacterizes a specific property of an image. A fea- tance measure for comparing distributions is the ture is typically represented as a set of numbers, Mahalanobis distance, widely used in statistical patoften called a feature vector, although several vector tern recognition. Rubner, Guibas, and Tomasi [10] operations, like addition and multiplication by a recently used a color distance function called the constant, are never performed on them. The opera- “earth mover’s distance” to compute the work tions used most often are: involved in converting one distribution to another • Projection. Projection creates a lower-dimensional distribution. The features of a visual object can also vector from a higher-dimensional vector by be represented as groups of points in the feature COMMUNICATIONS OF THE ACM December 1997/Vol. 40, No. 12

37

space. To illustrate, assume a feature called “texture” of an image region can be represented by three numbers: randomness (a chessboard has little randomness, while an arbitrary set of dots has a lot); periodicity (repetitiveness of pattern); and directionality (the stripes in the American flag have an oriented texture) [7]. Consider an image with 10 different regions, each with a different texture value. These texture values form 10 points in the randomness-periodicity-directionality coordinates. How similar is this image with another that has 10 other textured regions? Many distance functions are defined between point sets. Eiter and Mannila recently compared a set of such distance functions [3], and many other kinds of distance functions have also been reported in the literature. How

• Find Boundary. Given a set of points in feature space, Find Boundary returns a hyperpolyhedral boundary of the points. It can be used for exploratory queries like, “Here are 10 examples of the American flag. Show me the part of the feature space covering all 10 instances.” Once we get back the region, we can ask further queries, like, “What other visual objects belong to this area?” • Select by Spatial Constraint. In its simplest form, Select by Spatial Constraint is a range query that retrieves all feature points lying within (or outside) a hypercube in the feature space; in the 2D feature space, a hypercube is a rectangle whose bounds have been specified. In a more general form, the range may be specified by using con-

Information is only when it can be retrieved through an .

meaningful

expressive query

well do these distance functions portray the human sense of difference in appearance? We have seen no thorough investigation of the issue to date. Feature space. Regardless of whether or not it is a vector, the feature for a visual object inhabits some region in a space defined by its variables. In the texture example, we saw that the feature space is 3D. As we add more images to a database, the 3D space gets populated with a point for every new textured region. If we treat this space as an information object, we can query it by using a number of operations. Querying the feature space gives us not only a feeling for which objects are there in a specific part of the feature space but a general impression of what the database contains—an important class of queries for browsing and exploring the database. The most obvious operations are union, difference, and setmembership of point-sets in feature space. More involved operations include: 38

December 1997/Vol. 40, No. 12 COMMUNICATIONS OF THE ACM

straints on the boundary of the region (e.g., with “drawing” the boundary of the region by lines and curves). This operation, although commonly used in spatial databases, can be computationally prohibitive for feature spaces with a large number of dimensions or variables. • Select by Distance. Although discussed here because of its popularity, Select by Distance is a special form of the previous query, whereby the user picks a query feature point in space and the range is always the shape of a hyperellipsoid (an ellipse in 2D) around the point selected. The lengths of the different axes (dimensions or variables) of the hyperellipsoid is determined by how tightly the user wants a result to match the query feature point along an axis; the tighter the match, the shorter the axis. • k-Nearest Neighbor. This is the most popular query supported by current query-by-example systems. As with the previous query, the user selects a

query feature point. The operation finds the k visual objects, denoted by the other points in the feature space, closest in distance to the query feature point, often ranking them in order of distance (counting and ordering are two important operations in feature space). While this is a very useful query in itself, it may turn out to be less useful if the feature space is sparsely populated, because then the nearest neighbor can be very far into the feature space and very dissimilar to the example in appearance value. • Partition Space. Related to the classic transitive closure computation, this operation is used to divide the feature space into a small number of regions (called clusters in pattern recognition) by gathering close-by feature points into a region. Parts of the feature space with many nearby points are a dense group, while sparse regions may have only a few points per group. There are several methods for creating the partition; in one, the user provides two distance bounds d and d*, specifying that all members within the group must have mutual distance less than d and that the distance between two groups has to be at least d* (see the colored clusters in Figure 3). • Rename. Rename assigns a new name to a specific part of the feature space. For example, in the 3D texture feature space, the region high in both orientation and periodicity values may be renamed “stripes.” In a more general case, a Boolean combination of regions (Region A and Region B) can be renamed as a single entity. • Aggregate Operations. These operations include a group, like count, mean, standard deviation, and cluster diameter, that compute aggregate properties in the feature space. Clearly, users can compose more meaningful queries by combining these operations and a few programming constructs. For example, an advanced user can express the query, “Find k-Nearest Neighbors of the query feature point within its cluster,” by saying, “Partition the feature space into a set of regions called R. Find the boundary of the k-Nearest Neighbors of the query feature point, and call the bounded region S. Compute R0S.”

Feature Groups Grouping more than one simple feature into one complex feature often makes the complex feature more expressive. In [4], the authors refer to the com-

Figure 3. Detailed view of the area found to contain cats on the x-axis. We are actually in a “cat-rich” area, although there are many other images.

bination of a skin-color detector (renaming a transformation on the hue-saturation-luminance space) and cylindrical geometric features (a range-selection in the shape feature space of all closed objects detected in images) to detect bare-skinned human figures. In a more general framework, feature groups can be created by at least two kinds of operations: • Distance Aggregation. Suppose there are two features, such as color and texture, with their own feature spaces and distance functions. Let the color-distance between two visual objects be d1 and the texture distance between them be d2. We can then compute a combined distance D between them through combination function F1, which may be as simple as a weighted sum. Thus, D = F1(d1, d2). Taking this one step further, we can compose an aggregate hierarchy of distances of the form D = F3(d4, F2(d3, F1(d1, d2))). Even when the functions are all implemented as a weighted sum, Distance Aggregation allows different feature combinations by allowing the user to adjust weights [6]. • Dimension Joining. If we want to create a domainspecific feature from a set of primitive features, such as a red-and-white-stripe detector (for the American flag) or a bare-skinned-human detector, we need a more sophisticated tool than Distance Aggregation. Dimension Joining is the construcCOMMUNICATIONS OF THE ACM December 1997/Vol. 40, No. 12

39

tion of a new feature axis (colored stripe) by a natural join-style operation between chosen feature axes (high periodicity and hue). However, we also need to specify a distance function for the newly computed feature space.

Image Space If we ask, “Show images with more than 20% blue on the top, more than 40% green in the bottom section, and a red-pink area in the middle,” we may be looking for an outdoor picture showing a flower in a garden. This example shows how a user may describe the compositional elements of the desired image and their layout in the image space. Understandably, much image-database research has been devoted to methods for using locations, sizes, and arrangements of the image’s compositional elements for retrieval. Such queries are allowed through three kinds of systems: Those with explicit spatial data structures and operators. There has been strong interest from the database community in applying spatial database techniques to queries on the image space. These systems have developed either a query language with spatial operators (such as PICQUERY+ [1], which allows query conditions like SIZE = short AND INTERSECTS_WITH = wrist) or a first-order logic scheme (similar to relational calculus) with additional syntax and semantics for spatial operations. Del Bimbo et al. [2] developed a logic with regionbased expressions to specify positional information and object-based expressions for specifying interobject relationships. The flexibility these systems offer in formulating visual queries works well for constrained domains, but in our experience, they are often offset by a general lack of control in the set of images present in the database. For example, with these systems, it is difficult to express queries like, “Find images with scattered white regions on a blue background at the top of the image.” It is necessary for database researchers to extend the scope of their well-defined domains to accommodate the imprecision inherent in the world of images. Those using fixed regions and implicit spatial functions. These systems divide the image space into a number of predetermined regions. These regions can simply be 8 3 8 pixel blocks. A system by Stricker and Dimai [11] uses a central oval region and four image corners. The user needs to specify which regions are important for spatial matching. The system computes image features for each specified region and evaluates candidate images by com40

December 1997/Vol. 40, No. 12 COMMUNICATIONS OF THE ACM

puting region-wise similarity. Once the similarity for each region is evaluated, the system evaluates a composite function to combine the region-wise similarities into a composite image-level similarity. Stricker’s and Dimai’s system, for example, uses a fuzzy function that puts more weight on the central oval region; the weight progressively diminishes away from the center. Similarities from the corners are added as a weighted function. Stricker’s and Dimai’s system is also insensitive to 90-degree rotations. This transformation is an important issue all image space systems have to address. Should a translated, rotated, scaled, or somewhat geometrically transformed version of an image be considered a near-perfect match with the original? The answer is strictly domain dependent; what is acceptable for a stock photo house may not be for a satellite image analyst. However, a system that handles image-space queries is incomplete unless the system designer specifies the types of invariance allowed for geometrically transformed images [5]. Those using segmented regions and implicit spatial functions. We are beginning to see more systems that attempt to segment an image to extract its constituent regions. For example, in the NETRA (in Sanskrit, netra means “eye”) system [8], developed at the University of California at Santa Barbara, images are segmented into homogeneous regions. Users can compose such queries as, “Retrieve all images containing regions that have the color of object A, the texture of object B, and the shape of object C and lie in the upper third of the image,” whereby individual objects could be regions belonging to different images. Ideally, these regions can be treated as spatial objects with location, shape, size, and neighborhood properties and execute queries containing spatial and topological predicates. Realistically, however, segmentation is imperfect, leading to errors in the computation of spatial properties. This imperfection makes even simple operations, like adjacency testing, more complex than current spatial systems have been used for. So far, we have not seen much research on the relation between approximations introduced by segmentation algorithms and the performance of spatial queries.

A Search Environment for Visual Information Although a great deal of information is computable from both single images and image collections, having extractable information is never enough. Information is meaningful only when it can be retrieved

Figure 4. Global view of the database. The three axes correspond to color, color distribution, and structure. The axes display a sample of the images taken randomly at several points along the axis.

through an expressive query. Many recent systems (such as NETRA, as described by Ma [8]) have upgraded from features computed from whole images to those that compute features for each segmented region. A system developed at the University of Massachusetts, Amherst, [9] can express spatial relationships between segmented objects. For most such systems, however, the primary mode of search is still a k-Nearest Neighbor query over region objects. To date, no system offers a search environment to express queries that involve all the information types described in the previous section. In practice, however, similarity queries work fairly well in many applications. The reason is that similarity queries in most of these systems are based on simultaneous comparison on many different image attributes. For example, color itself has three variables, and one can define at least 12 different texture measures based on an image’s gray-level cooccurrence matrix and many more variables for shape-like and position-sensitive properties. Through the simple laws of joint probability of these many AND-ed variables, it is only natural that the results of a k-Nearest Neighbor query match the query image very closely. We have also seen many cases in which users

issue similarity queries with a particular intent, get back unexpected results, and discover the result set, though different from their personal intent, is quite consistent with the original query. At the University of California at San Diego, we have sought to model how users search through a large collection of images, particularly when they have no prior knowledge of the collection’s content and have an imprecisely formed query in mind. We hold that such users go through a number of explore-navigate-selectrefine cycles before identifying the objects of interest. Users start the exploration phase by first looking at the distribution of images in a virtual space whose axes are features they themselves select. They usually opt to see sample images from any part of the virtual feature space (see Figure 4). Alternatively, users could start by querying on an example image. In such cases, the distribution is recomputed in terms of the feature distance from the referent image. As users turn the feature cube around or walk through it, they become aware that although there is a wide variation along the color axis, it is the color distribution axis that shows meaningful clusters. In this case, there are several cat images on the x-axis. Once users develop a basic feel for the organization or content, they try to sift through information objects that seem mutually correlated and are possibly related to the query intent. This information sifting is the navigation phase. In our example, we would like somehow to express the idea of “cathood” as a query. So we click on the part of the feature cube where we saw some cats. Note that in the cube in Figure 3, a group of images is quite close to the intended query and blends into the larger group. After users have played with images that seem related to the initial unspecified query, they actively specify the search criteria (make a selection) and execute another query. In addition to range queries and similarity queries in feature space, our system at San Diego permits an unconventional selection query using a cluster-selection operation. Roughly, the operation means, “I have selected a set of images to be somewhat close to what I want. Now determine a similarity criterion so these images are very close to each other, and make the search.” The result positions COMMUNICATIONS OF THE ACM December 1997/Vol. 40, No. 12

41

selection, and incremental refinement of user queries. However, large databases containing semantically rich data sets, like images with multiple domain- and applicationdependent semantics, require interaction environments far richer than those provided by the current generation of visual information systems. The ideas presented here are only a small step in a very rich research direction. c References

Figure 5. The user has already selected a set of cat images for a cluster selection, implicitly defining a similarity criterion. The distribution was recomputed, taking the user to a cat-intensive part of the database.

the user in an area in the feature space with a higher density of cat images (see Figure 5). The refinement phase follows, in which search parameters are modified to improve the quality of results. In most systems, adjusting relative weights of the primitives is the only way to perform query refinement. In our system, users can also explore any cluster, using it to determine similarity criteria. Although these operations lead to richer user interaction than in most systems, we are far from being able to associate semantics with images. The more informed and focused the user is, the less time is needed for exploration, and the sooner the selection is made. In any case, if the refined query does not produce desired results, users may go back to the navigation or exploration phase.

Conclusions A major difference between the query environment of a traditional database system and that of a visual information management system is that the latter needs to be able to handle a plurality of possible interpretations of data. This shift of focus from precise and well-formed queries and provably correct results to Iknow-it-when-I-see-it correctness requires a query environment embedding visualization and visual manipulation of data. Such an environment facilitates exploration of the data, flexible feature and similarity 42

December 1997/Vol. 40, No. 12 COMMUNICATIONS OF THE ACM

1. Cardenas, A., Ieong, I., Barker, R., Taira, R., and Breant, C. The knowledge-based object-oriented PICQUERY+ language system. IEEE Trans. Knowl. Data Eng. 5, 4 (Aug. 1993), 644–658. 2. Del Bimbo, A., Vicario, E., and Zingoni, D. Symbolic description and visual querying of image sequences with spatio-temporal logic. IEEE Trans. Knowl. Data Eng. 7, 4 (Aug. 1995), 609–622. 3. Eiter, E., and Mannila, H. Distance measures for point sets and their computation. Acta Inf. 34, 2 (Feb. 1997), 109–133. 4. Forsyth, D., Malik, J., Fleck, M., Greenspan, H., Leung, T., Belengie, S., Carson, C., and Bregler, C. Finding pictures in large collections of images. Tech. Rep. CSD96-905, Univ. of California, Berkeley, 1996. 5. Gudivada, V., and Raghavan, V. An experimental evaluation of algorithms for retrieval by spatial similarity. ACM Trans. Inf. Syst. 13, 2 (Apr. 1995), 115–144. 6. Gupta, A. Visual information retrieval: A Virage perspective. Tech. Rep. TR95-01, Virage, Inc. , San Mateo, Calif., 1995. 7. Liu, F., and Picard, R. Periodicity, directionality, and randomness: Wold features for image modeling and retrieval. IEEE Trans. Patt. Anal. Mach. Intel. 18, 7 (July 1996), 722–733. 8. Ma., W. NETRA: A toolbox for navigating large image databases. Ph.D. Dissertation, Dept. of Electrical and Computer Engineering, Univ. of California at Santa Barbara, 1997. 9. Ravela, S., and Manmatha, R. Characterization of visual appearance applied to image retrieval. In Proceedings of DARPA Image Understanding Workshop, T. Strat, Ed. (New Orleans, La., May 11–14, 1997). Morgan Kaufman Publishers, San Francisco, 1997, pp. 693–699. 10. Rubner, Y., Guibas, L., and Tomasi, C. The earth mover’s distance, multi-dimensional scaling, and color-based image retrieval. In Proceedings of DARPA Image Understanding Workshop, T. Strat, Ed. (New Orleans, La., May 11–14, 1997). Morgan Kaufman Publishers, San Francisco, 1997, pp. 661–668. 11. Stricker, M., and Dimai, A. Color indexing with weak spatial constraints. In Proceedings of the SPIE Storage and Retrieval for Image and Video Databases IV, vol. 2670, I. Sethi and R. Jain, Eds. (San Jose, Calif., Jan. 28–Feb. 2, 1996). SPIE, Bellingham, Wa., 1996, pp. 29–40.

Amarnath Gupta ([email protected]) is a senior software scientist in Virage, Inc. Simone Santini ([email protected]) is a Ph.D. candidate in the Computer Science Department in the University of California at San Diego. Ramesh Jain ([email protected]) is a professor of electrical and computer engineering in the University of California at San Diego and is the chairman of the board and founder of Virage, Inc. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.

© ACM 0002-0782/97/1200 $3.50