A Fast Indexing Method for Multidimensional ... - Semantic Scholar

0 downloads 0 Views 148KB Size Report
J. Shepherd*a, X. Zhub and N. Megiddob ... The work reported here was performed while John Shepherd was on study leave at IBM Almaden Research Center.
A Fast Indexing Method for Multidimensional Nearest Neighbor Search J. Shepherd*a , X. Zhub and N. Megiddob a University of New South Wales, Sydney, NSW, Australia b IBM Almaden Research Center, San Jose, CA, USA

ABSTRACT

This paper describes a snapshot of work in progress on the development of an ecient le-access method for similarity searching in high-dimensional vector spaces. This method has applications in, for example, image databases where images are accessed via high-dimensional feature vectors. The technique is based on using a collection of space- lling curves as an auxiliary indexing structure. Initial performance analyses suggest that the method works eciently in moderately high-dimensional spaces (256 dimensions), with tolerable storage and execution-time overhead.

1. INTRODUCTION

\New generation" database applications, such as multimedia databases, data mining and information retrieval often access the stored data via feature vectors and similarity measures. Data stored in such systems can be viewed as a collection of points in a vector space and the similarity measure can be viewed as a measure of distance within that space. An important problem in such systems is that of nding the k most similar stored objects to a given sample object. This problem is typically solved by rst mapping the sample object to its corresponding feature vector. Finding the solution set then becomes the problem of nding the k nearest-neighbors of a vector in a large collection of vectors (the k-NN problem). For applications where the vectors have dimensionality less than 10, existing multi-dimensional access methods (such as R-trees5) can be usefully employed to solve the k-NN problem. So far, however, there is no e ective solution to the general k-NN search problem for applications where the vectors have dimensionality greater than ten. Around this dimensionality7 the performance of existing multi-dimensional access methods degenerates to being worse than a linear scan of the vectors. Only in special cases, for example where knowledge of the data distribution can be exploited, is it possible to achieve ecient k-NN searching with very high-dimensional vectors. In this paper, we present a new technique that provides a promising direction towards solving the k-NN search problem for large collections of high-dimensional vectors. Our interest in this problem derives from the development of e ective and ecient content-based image retrieval systems. In content-based image retrieval, the properties of each image are described by a set of feature vectors automatically extracted from the image. Typical properties captured via feature vectors include color, shape and texture. The dimensionality d of these vectors is often high, of the order 102 or higher. The \similarity" between two images is determined by a distance function on feature vectors (often a Euclidean distance function). Solving the content-based image retrieval problem thus requires us to solve the k-NN problem for the set of feature vectors under the image distance function. Note that the cost of computing the distance function for very high-dimensional vectors can be considerable. The testbed for our exploration of the k-NN problem is the QBIC3 image retrieval system. We focus on the speci c problem of retrieving images via a 256-dimensional color histogram, using QBIC's special-purpose color-similarity measure. We also assume that QBIC will be deployed on \somewhat dynamic" image databases (i.e. where the set of images changes by some small percentage each day). The latter condition means that update cost, as well as retrieval cost, is an issue in determining the usefulness of our system. For a database of N images, we assume that each image is stored as a separate le. The index structure contains a collection of N color histograms, one for each image. Each histogram is stored in a tuple, where the other component of the tuple is the name of the image le which produced the histogram. Thus, at query time, the rst step is to examine the histogram tuples, computing the similarity between stored histograms and the query histogram. Once 

Author email contacts: John Shepherd [email protected], Xiaoming Zhu [email protected] The work reported here was performed while John Shepherd was on study leave at IBM Almaden Research Center.

we have identi ed a histogram whose distance is \close enough" to the query histogram it is a simple matter to retrieve the corresponding image, since we have access to the name of the le which contains it. The naive approach to solving the k-NN problem is to simply scan the database, computing the distance between each database image and the query image, and maintaining a list of the k images with the smallest distance values. If we assume that feature vectors are elements of Rd, then we can describe this method as:

Data Structures:

database Db of N d-dimensional vectors (from Rd), indexed on im age id j query vector q (from Rd )

Database Construction: insert vj into Db

Query Evaluation:

for i in 1::N do fetch vi from Db d = distance(q,vj ) include j in k-NN if d small enough end

In this approach, insertion is extremely simple, and hence ecient. For deletion, once the records to be deleted are found (presumably using the image identi er), the e ort to delete them is minimal. For queries, however, the method requires us to read all N vectors from the database and compute the distance function N times. To achieve better performance than a linear scan of the database, ecient index methods for the feature vectors are needed. Traditional approaches to this problem using R-trees5 k-d-trees1 and their derivatives4 fail to provide satisfactory results for high-dimensional vectors because the computational complexity is generally proportional to 2d. Our approach, based on recent work by Megiddo and Shaft,6 uses a collection of space- lling curves to act as an index structure. The scheme performs retrieval by mapping data and query images onto the curves and considering as candidates only images whose location is close to the curve-location of the query image on at least one curve. Since each curve is a one-dimensional projection of the vector space, the set of \close points" on the curve can be eciently determined by using standard one-dimensional indexing schemes such as B-trees. Other recent approaches towards solving the k-NN problem include VA- les7 and the Pyramid technique.2 VA les uses a linear scan over a le of signatures that allow the rapid computation of lower-bounds on the distance for each data point. The Pyramid technique partitions the d-dimensional space such that it can be accessed in \stages"; like our approach, it also makes use of standard one-dimensional indexing schemes. In this paper, we provide an overview of our new method, CurveIx, discuss its performance characteristics and present some experimental results which indicate that it can provide ecient retrieval in the context of QBIC image searching.

2. METHOD

The basic idea behind our new method, which we will denote CurveIx (for \curve index"), is to order the d-dimensional space in many ways, with a set of (one-dimensional) space- lling curves, each constituting a mapping from Rd ! R1 . This mapping provides a position along the curve for any d-dimensional feature vector. In essence, this gives a linear ordering of all points in the data set. A useful property of such a linear ordering is that unlike projection, close points along the space- lling curve tend to correspond to close points in the d-dimensional feature space. Therefore, when a query vector is mapped to the space- lling curve, one can perform a range search for \nearby" points along the curve to nd near neighbors in the feature space. However, due to the nature of the Rd ! R1 mapping, some near neighbors of the query vector may be mapped far apart along a single curve. To make sure that these points are not overlooked, multiple space- lling

curves are used, based on di erent mappings from Rd ! R1 . The set of candidate nearest neighbors is formed from the union of small neighborhoods (ranges) from all of the curves. The mapping function Ci takes a d-dimensional vector from Rd and forms a single real number by interleaving bits from the values along each dimension. For example, if the vector consists of: < 0:b1;1 b1;2:::; 0:b2;1b2;2 ::: ; 0:b3;1b3;2 :::; : : : 0:bd;1bd;2 ::: > then it would produce the real value 0:b1;1b2;1 :::bn;1 . The same technique is used for each Ci , with the di erent mappings being achieved by transforming the vector before applying this technique. Let us now consider in detail the methods used to construct and query the database with CurveIx :

Data Structures:

database Db of N d-dimensional vectors (from Rd), indexed on image id j database Index of Nm curve points, indexed on curve# and position (i; p) m functions Ci mapping Rd onto [0; 1) - each vector is mapped onto a value on a Hilbert curve through the d-space - each Ci provides a di erent mapping by permuting and translating the original vector query vector q (from Rd )

Database Construction:

for each vector vj , j in 1..N do for i in 1::m do p = Ci (vj ) insert ((i; p); j ) into Index end insert vj into Db end

Query Evaluation:

for i in 1::m do p = Ci (q) lookup (i; p) in Index while \close to" p on curve i do collect next j on curve as candidate end end for each candidate j do fetch vj from Db d = distance(q,vj ) include j in k-NN if d small enough end

The di erent mapping functions (Ci ) are produced by rotations and translations of the points within the space. In the context of our feature vectors, the rotations are implemented by forming permutations of the elements of each vector, and the translation is accomplished by scaling some elements in the vector. At present, the rotations and translations are essentially random. It is an open problem whether a set of \optimal" translations and rotations could be devised for arbitrary data. Insertion requires us to not only insert each histogram in the database Db, but also to insert m new curve points into the Index for each histogram. Similarly, removing a vector from the data set requires us to delete one point from each of m curves. For N histogram records, we are required to store mN curve points, plus their associated access method (in our case, a B-tree). These overheads are paid back at query time, where we read considerably smaller proportion of the database than would be required with a linear scan. The query-time cost involves m B-tree lookups (one for each curve), plus examination of the points on each curve near the query points. Ideally all of these points would lie on a single page of the Index database, and the

Method

Accuracy

pmin pmax

pavg

E

Dmin

Eciency Dmax

Davg

Linear scan 100% 100% 100% 25/25 10,000 10,000 10,000 CurveIx (m = 40; r = 7) 10% 100% 84% 11/25 161 1147 616 CurveIx (m = 40; r = 11) 30% 100% 88% 14/25 331 1334 918 CurveIx (m = 40; r = 15) 30% 100% 90% 15/25 518 1544 1227 Figure 1. Experimental Results, 25 queries on a 10000 image database curve-access overhead would be around m pages. In practice, however, we are likely to have to read one or two of the neighboring pages as well, and so 2m page reads is a more likely value for this component of the query cost. The gain comes when we establish a candidate set that is considerably smaller than the total number of histograms. There is, however, another important aspect of the system: accuracy. As noted above, it is possible that a near neighbor of the query may be mapped far away from the query on the curve. We have suggested using multiple curves, based on translations and permutations of the original vector, in order to increase the likelihood of nding such a near neighbor, but even this provides no guarantee that every near neighbor will be mapped close to the query on at least one curve. Thus it is possible that some near-neighbors may be omitted during a search under the CurveIx method. We can improve the chances of nding all nearest neighbors by using still more curves or by scanning a larger neighborhood on each curve, but both of these approaches are likely to increase the size of the candidate set, thus diminishing the query cost bene ts of the method. There is thus a trade-o between eciency and accuracy in the CurveIx scheme. This discussion raises the issue of how many curves are required in practice, and how much of each curve we need to examine, in order to achieve a sucient level of accuracy. Every new curve makes the method more accurate, but increases the query-time overhead, because we need to perform an indexed lookup on each curve to determine the the near neighbors. To a lesser extent, increasing the range of points on each curve also increases the query-time overhead. It may not require extra page reads (if the extra vectors lie in the already fetched page), but is almost certain to lead to more distance calculations being required. How well does this method work in practice? As we shall see in the next section, experimental analyses on a database of 10,000 widely-varied images have shown that it is possible to nd on average nine of the ten nearest neighbors (and always nd the two closest neighbors) with only 40 curves and with a search neighborhood of less than 30 curve points.

3. EXPERIMENTAL RESULTS

In this section, we report the results of some preliminary experiments to determine the e ectiveness of this approach on a set of image data collected for use with QBIC. The data consisted of 10,000 images, primarily photographs, from a number of di erent collections on the Corel PhotoCD, from which the QBIC 256-color histograms had been extracted. The queries comprised 20 images chosen at random from the database itself, plus ve other images from other unrelated collections. The database was indexed using 40 curves whose rotations and translations were determined at random. To determine the \closeness" on each curve, the neighborhood was chosen to be a simple range of r points above and r points below the query point (for r = 7; 11; 15). In summary, the experiment involved 25 k-NN queries on a database with N = 10; 000, d = 256, m = 40, k = 10. To measure the accuracy of the indexing method, we collected statistics on what proportion p of the top-10 answers (according to the distance function) were retrieved for each query, and counted how often the index method gave the exact answer E . To measure the eciency of the method, we collected statistics on how many times D the distance function was computed (i.e. the size of the candidate set). Table 1 summarizes this data. For comparison, we also measured the eciency of a linear scan of the database to determine the top-10 matching images (naturally, the accuracy of such a method is 100%). The major result here is that with a moderate number of curves (40) and with an e ective scanning range on each curve of less than 20, nearly all (90%) of the most similar images can be found.

The results for Dmin ; Dmax ; Davg require some explanation. For m curves with a scanning range of r, the method should examine (2r + 1) points on each of the m curves to produce the set of candidates. Since some candidates may appear within the scanning neighborhood on more than one curve, we can ignore the duplicates and m(2r + 1) should be an upper bound on the size of the candidate set. So, for example, with a range r = 7, we would examine 15 points on each of m = 40 curves, and thus make somewhere between 15 and 600 distance calculations. Under our current implementation, we use only a partial order of the points on the curve (i.e. multiple vectors can map to the same curve point). This is necessary because we need to use the curve position as a numeric key for performing comparisons in the dbm lookup function. The default behaviour of our Hilbert-number generators is to simply increase the accuracy of (i.e. use more bits from each dimension) until two Hilbert-numbers can be distinguished. However, we need to store these numbers, and so we cannot allow them to grow without bound. Placing a bound on the \accuracy" of the Hilbert-numbers means that some vectors that are actually distinct, produce the same number. In terms of query-time performance, this means that we sometimes need to scan slightly beyond the scanning range to consider all points with a given approximate Hilbert value. What this means, in e ect, is that we have increased the scanning range by a small amount from the stated r value. As far as input/output costs are concerned, under our implementation a database size of 10,000 represents a break-even point for the CurveIx method compared to linear scan. If the database is smaller, linear scan is cheaper; as the database size increases, however, CurveIx becomes a more viable option. We used the dbm library to implement B-tree indexes for the curve points as well as the histogram data, linked via the image identi er. Each index used an 8KB page size, meaning that we could store approximately 10 histograms and 20 curve points per page. We need to examine from 1-3 pages of information for each curve, giving around a 100-page indexing overhead for 40 curves to obtain a candidate set of size around 1000. If we assume no clustering on the image data, then we will need to read on average almost all of the data pages in order to read all 1000 candidate histograms. For larger databases, the size of the candidate set should not increase, and so we need to read a progressively smaller and smaller proportion of the histogram data. By way of comparison with our approach, we brie y describe the costs associated with the VA- les technique7 / Under VA- les, each vector has an associated signature, computed by interleaving bits from each of the individual vector elements, in a similar manner to the method we use to form Hilbert-numbers. If we allocate only 3 bits for each dimension, we obtain a 728-bit (96-byte) signature. To answer a query under VA- les, we need to scan all of the signatures to determine a set of candidate vectors. 10000 For a 10,000-image database, this means an initial overhead of d8192 =96e = 118 page reads. This is similar to the input/output overhead for CurveIx. However, for each signature read , an approximate distance calculation needs to be carried out in order to determine the nal set of candidate records. This is a considerable additional cost compared to CurveIx, and an experiment with a 1,000-image database indicated that the VA- les approach did not produce candidate sets that were signi cantly smaller than those produced by CurveIx. On the other hand, VA- les candidate sets are guaranteed to contain the k nearest neighbors.

4. CONCLUSION

While it is premature to claim that we have a complete general solution to the high-dimensional nearest-neighbor problem, this approach does appear to be promising at least for our image similarity application. Areas that require further investigation include:   

how many curves are needed to maintain accuracy for di erent databases how to control the scan over the curve neighborhood alternative schemes for mapping histogram data onto curves (to overcome quantization problems)

For the rst of these points, we have conducted experiments on a range of databases with a range of curve numbers. It appears that the number of curves needs to increase with increasing d in order to maintain a xed accuracy level, but it is not clear whether the rate of increase is greater than linear with respect to d.

For the second point, using a xed-size neighborhood on each curve was a simple-to-implement initial solution. From observations of query-time performance, it now appears that in many cases the neighborhood scan on each curve could be terminated well-before reaching the static neighborhood size. Heuristics such as terminating the scan as soon as a point is found that is \too far" from the query point may enable our method to maintain accuracy levels but with less query-time overhead. This paper has presented a new approach to indexing high-dimensional data to assist with k-NN searching over that data. For large QBIC image databases, the CurveIx method appears to provide an e ective solution for the problem of ecient retrieval of the k images most similar to a given query image.

REFERENCES

1. Jon Louis Bentley. Multidimensional binary search trees in database applications. IEEE Transactions on Software Engineering, 5(4):333{340, 1979. 2. Stefan Berchtold, Christian, and Hans-Peter Kriegel. The pyramid-tree: Breaking the curse of dimensionality. In Proceedings ACM SIGMOD International Conference on Management of Data (SIGMOD 1998), pages 142{153, 1998. 3. Myron Flickner, Harpreet S. Sawhney, Jonathan Ashley, Qian Huang, Byron Dom, Monika Gorkani, Jim Hafner, Denis Lee, Dragutin Petkovic, David Steele, and Peter Yanker. Query by image and video content: The QBIC system. IEEE Computer, 28(9):23{32, 1995. 4. Volker Gaede and Oliver Gunther. Survey on multidimensional access methods. Technical Report ISS-16, Humboldt University, 1995. 5. Antonin Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD'84, Proceedings of Annual Meeting, pages 18{21, June 1984. 6. Nimrod Megiddo and Uri Shaft. Ecient nearest neighbour indexing based on a collection of space- lling curves. Technical report, IBM Almaden Research Center, 1997. 7. Roger Weber, Hans-Jorg Schek, and Stephen Blott. A quantitative analysis and performance study for for similarity-search methods in high-dimensional spaces. In Proceedings of the Very Large Data Bases (VLDB) Conference, August 1998.