The SS+-tree

1 downloads 0 Views 224KB Size Report
3. SS+-TREE PROPERTIES. Some of the most of often discussed properties of a R-tree .... Figure 3: The volume and surface area of the unit sphere (r = 1).
The SS+-tree: An Improved Index Structure for Similarity Searches in a High-Dimensional Feature Space R. Kurniawati, J. S. Jin, and J. A. Shepherd School of Computer Science and Engineering University of New South Wales Sydney 2052, Australia

ABSTRACT In this paper, we describe the SS+ -tree, a tree structure for supporting similarity searches in a high-dimensional Euclidean space. Compared to the SS-tree, the tree uses a tighter bounding sphere for each node which is an approximation to the smallest enclosing sphere and it also makes a better use of the clustering property of the available data by using a variant of the k-means clustering algorithm as the split heuristic for its nodes. A local reorganization rule is also introduced during the tree building to reduce the overlapping between the nodes' bounding spheres. Keywords: high-dimensional indexing and retrieval, similarity search, multimedia databases, enclosing spheres, enclosing boxes (MBR)

1. INTRODUCTION Representing images by a vector of features is a useful approach to handling the retrieval of images from image databases.10,17,19 Under such a scheme, similarity searches can be performed by using the Euclidean distance between two vectors as a measure of dissimilarity. However, feature vectors for images tend to have high-dimensionality and hence are not well-suited for indexing via the conventional access structures. Similarity searches1,26,25 are generalized k-nearest neighbor searches, where we can choose to trade the search performance with accuracy. In similarity searches, we can specify the approximation error  allowed and the search result will get the k-nearest neighbour within (1 + ) from the closest missed true nearest neighbour.

2. ISSUES RELATED TO SIMILARITY SEARCHES IN A HIGH-DIMENSIONAL SPACE A similarity search in a high-dimensional space is a dicult problem, and in general, as Minsky and Papert conjectured,16 we will end-up having to examine a signi cant portion of the data. There have been a number of indexing data structures suggested to handle high-dimensional data: the Similarity Search tree (SS-tree),27 the  Corresponding

author: E-mail: [email protected]; Telephone: 61-2-9385-3979; Fax: 61-2-9385-5995

Telescopic Vector tree (TV-tree),9,13 and the X-tree.4 These new structures all perform better than the R -tree, the best conventional multidimensional indexing structure available, for various reasons: the TV-tree actually reduces the dimension of the vectors by collapsing the rst few dimensions with the same values, the SS-tree utilizes the clustering property of the data, and the X-tree tries to minimize the overlapping between nodes' bounding boxes (hyperboxes). Preliminary analytical results with R-tree-like structure18,12 showed that the expected number of leaf nodes touched by a window query is proportional to the area covered by the bounding boxes, the weighted sum of the bounding boxes perimeter, and the weighted sum of the number of leaf nodes. The analytical result is in agreement with the performance enhancement techniques for the R-tree related structures. All the overlapping reduction techniques are actually an indirect way of reducing the total volume and surface area. E orts to get high utilization will tend to minimize the number of nodes. The surface area minimizing split heuristic used by R -tree is one of the reasons for the tree's good performance.

3. SS+-TREE PROPERTIES Some of the most of often discussed properties of a R-tree based multidimensional access structure are splitting heuristics, the shape of the bounding envelope of each node, the criteria used to choose the subtree to insert the new data, and the heuristics used to reduce overlapping between the nodes and bad decisions made earlier. These properties are not independent of each other. The splitting heuristic and subtree selection criteria will determine the shape of each node and hence the goodness of the bounding envelope. The goodness of the bounding envelope will also determine the extent of the overlapping between nodes. The heuristics used to reduce overlappings and early bad decisions will also a ect the shape of the nodes. 3.1.

Node Split Heuristic

Almost every tree-based multidimensional access structure, e.g. the R-tree and its variants,8,3,21 SS-tree,27 X-tree,4 uses a splitting plane that is perpendicular to one of the coordinate axes. Except for SS-tree, the split criteria used are mainly topological ones, eg. the minimization of the perimeter, area, the increase of the node's area, or overlappings. Because the tree structure aims to support similarity searches, we would like nearby vectors to be collected in the same or nearby nodes. This means that we prefer a division of the data that re ect the data clustering and also means that the less the variance within the nodes is the better. Dividing data into groups while optimizing a statistic has been studied extensively in the clustering area. For the SS+ -tree, we chose the widely used k-means clustering algorithm.14 The computational complexity for this clustering algorithm is O(cni), where n is the number of data and c is the number of cluster, and i is the maximum number of iterations done by the algorithm. Applied as the splitting heuristic, the number of clusters c will be two, the number of data points is bounded by the maximum node's capacity plus one, and we can always set an upper bound for the number of iteration done by the algorithm. With this splitting rule we have a greater degree of freedom and the algorithm will seek to minimize the variance in the resulting partitions. Also note that in a Euclidean space, this splitting heuristic will tend to produce spherical-shaped nodes. Related to these splitting heuristic, like the SS-tree, we use the closest centroid as the criterion for choosing the subtree into which we insert new data.

 Throughout this paper we will use the more familiar two- or three-dimensional term: boxes for hyperboxes, circles or spheres for hyperspheres, lines or planes for d ? 1 dimensional hyperplanes, and perimeters or surface areas for (d ? 1)-dimensional face volume.

3.2.

Overlap Reduction

Using k-means splitting rules has it own consequencesy , one node with a bigger variance might overlap a smaller node (see Figure 1a for an illustration { here we use bounding spheres for the nodes, but a similar situation will also occur even if we use bounding boxes). This kind of situation is not desirable since it will be dicult for us to determine which node is better when we do the search and also the total volume/area of the nodes is actually larger than if we don't have any node that \eats" another.

(a) Before

(b) After

Figure 1: The nodes con guration before and after the application of the local reorganization rule. To alleviate the situation where a node boundary expands and heavily overlaps its siblings, the parent of the nodes which have been recently updated should check this situation (it will take O(n) time, where n is the number of child within the parent node). If we have this situation, we invalidate all the children (say k) involved and do a general k-means clustering of the grandchildren. The result of this rule can be seen in Figure 1b. The reorganization itself, if necessary, will take O(nk) time and measures should be taken to make this happen infrequently, like putting a guard that will decay with the insertions of new dataz . We also considered the use of R+ -like downward split propagation to completely remove the overlapping between nodes using the power planes2 between the spheres as the dividing plane. We nally discarded the idea since the scope of reorganization resulting from the downward split propagation could involve the whole subtree and the fragmentation that could happen will be very large since the number of possible power planes of n spheres is ndd=2e , where d is the dimensionality of the space. 3.3.

Nodes' Bounding Envelope

For the nearest neighbor queries in Euclidean space, Cleary5 has proved that decomposition into spherical regions is optimal in the sense that it minimizes the expected number of regions touched. Our intuition in two and three dimensional spaces leads us to expect that trees using bounding spheres for its nodes will perform worse than the one using bounding boxes (see Figure 2a and 2b as an examplex ). The immediately apparent problem with spheres in Euclidean space is that we cannot cover the space with spheres without overlappings; y This e ect will happen to any splitting heuristic ignoring overlap minimization as a criterion. z The tree used for producing the experimental results (section 4.) does not have these guards. x The bounding envelope of a parent node doesn't have to enclose all the bounding envelopes

of the parent's children. We use a lazy bounding envelope enlargement scheme; we only enlarge the bounding envelope if it does not enclose the newly inserted data/reinserted node.

furthermore, due to the isotropic nature of the spheres, extending the spheres in one direction will extend it in every direction. Whereas with boxes, we can tile the d-dimensional space without overlapping and we can choose 2d direction to extend it (if we have to extend near the corners, we will have less choices) and most of the time, we don't have to expand the volume as much as we do with a sphere.

(a) Bounding boxes

(b) Bounding spheres

(c) Bounding boxes and spheres

Figure 2: A 4-level SS+ -tree in a 2-dimensional space using various bounding envelopes If we look at the formula to calculate the volume (1) and the surface area (2) of a d-dimensional sphere radius r: d d=2

V (r; d)

= r(d=2)!

(1)

S (r; d)

=

(2)

drd?1  d=2 (d=2)!

The graph of the volume and surface area of the unit sphere (r = 1) can be found in Figure 3. The sphere volume will still grow exponentially with the dimension for r > 1, but the rate of growth is somewhat slowed down by (d=2) compared to square boxes with edges of length r (see Figure 4).

(a) volume

(b) surface area

Figure 3: The volume and surface area of the unit sphere (r = 1). Since our splitting heuristic tends to produce spherical clusters, this implies that in a high dimensional space, spheres will be better suited as the bounding envelopes { the amount of space wasted (Figure 5a) from using bounding boxes will increase exponentially with respect to the dimension (see Figure 5b and 5c respectively).

(a) r = 1

(b) r = 2

(c) r = 3

(d) r = 4

Figure 4: The volume of spheres with r = 1; 2; 3; 4 and the volume of square boxes with sides equal to r (the box volumes are drawn in thicker lines).

Square box, side = 2r

Wasted Area

Sphere, radius r

(a) The wasted area

(b) The wasted area with respect to the dimension (divided by rd )

(c) Log the scale of graph (b)

Figure 5: The wasted area resulting from using boxes instead of spheres for bounding spherical shaped nodes. Because the sphere volume grows exponentially with the dimension, in a high dimensional space a small increase in the radius of the sphere will give us a large increase in the surface area and volume of the sphere. In accord with the analytical results of R-tree,18,12 we want to minimize the total surface area and volume of these bounding envelopes. Computing the smallest enclosing sphere is not feasible in a high dimensional space, since the time complexity is exponential in the number of dimensions (for approaches using linear programming15,6,7) or even super-exponential.24 We use an approximation to the smallest enclosing sphere calculated by a spatial search utilizing the golden ratio ( )20 [ is the ratio with which we can divide a straight line such that the p ratio of the shorter segment to the longer one equals the ratio of the longer segment to the original line ( = ( 5 ? 1)=2 = 0:6180339887 : : :)]. The use of the golden ratio in the search for the center of the smallest enclosing sphere was proposed11 as a way to preprocess data before feeding them to Welzl's smallest enclosing sphere algorithm.24 Compared to other access structures using bounding spheres, the SS-tree27 uses the centroid and the distance of the farthest element from the centroid as the center and radius of the bounding sphere (which most of the time will not be the optimal one with respect to volume and surface area) and the sphere-tree23 uses the minimum bounding sphere computed by a variant of the method described by Elzinga and Hearn7 which at best is dependent exponentially on the dimension (the sphere-tree was only tested for 2-dimensional spatial data). An illustration of the working of the algorithm can be found in Figure 6. The algorithm starts with a guess (C1) of the initial center of the sphere which is the midpoint of the bounding box of all the points, it then nds the farthest point from this center (F1), and moves the center towards this farthest point using the ratio (the new guess for the center should be closer to the previous guess, hence the ratio of the distance of the new center (C2) to the farthest point (F1) to the distance of the previous center (C1) and F1 should be equal to

= 0:618 : : :). For the next iterations, we can shorten the line segment from the current center (C2) to the

farthest point (F2) using the halfspaces (A1) de ned by the plane perpendicular to the previous center-farthest point line (C1 F1). These exclusion areas (A1 : : : A3) are actually of spherical shape, we use planes to make the computation easier. Once the distance of the new center and the previous center is close enough (eg. C3 and C4 are close enough), we can obtain the sphere by using the current center point (C4) and the distance from the center (C4) to the farthest point from it. The generalization of this algorithm to calculate the approximation to the smallest enclosing spheres of a collection of spheres and boxes is straightforward after de ning a suitable farthest point for each. A1 C1

C1

C2

F1

(a)

(b)

A1

A1

A3

F2

A2

F2 C3

C2

C3

C4 F1

F1

F3 A2

(c) (d)

Figure 6: An Illustration of the calculation of the approximation to the smallest enclosing sphere.

4. EXPERIMENTAL RESULTS In this section, we give our empirical test results. First, we compare the spheres obtained from our approximation algorithm and the sphere resulting from using the centroid of the data as the center and distance of this center to the farthest point as the radius (Table 1, Figure 7). Then, we look at the similarity searches performance of the SS+ -tree on uniform point data comparing the e ect of using di erent bounding envelopes for the nodes. Lastly, we compare the performance of the SS+ -tree on the eigenface data.22 The comparison between the -search and the centroid-farthest point method to calculate the bounding spheres of 10 and 100 random points (for 100 trials each) (Table 1 and Figure 7 respectively) suggest that the advantage of using a smaller bounding sphere will be more evident for high-dimensional space. Although the ratio between the radius of the spheres produced by the search and centroid-farthest-point method (r-ratio) is not large, the ratio between volumes (V -ratio) can be quite signi cant in a high dimensional space. Actually our uniformly distributed test data gave an advantage to the centroid-farthest-point method, since in a collection of uniformly

Dim 10 20 50 100

10 points within each sphere

Cntrd r-ratio V -ratio 1.925 2.149 1.117 3.012 2.621 2.873 1.096 6.294 3.984 4.270 1.072 32.079 5.557 5.881 1.058 288.730

100 points within each sphere

Cntrd r-ratio V -ratio 2.268 2.412 1.064 1.854 2.993 3.179 1.062 3.341 4.413 4.675 1.059 17.766 6.195 6.384 1.031 20.246

Table 1: Comparison of the methods for calculating bounding spheres on random points using -search and centroid-farthest-point method. 1000

100 V-ratio

V-ratio

r-ratio

r-ratio

100 Log ratio

Log 10 Ratio 10

1

1 0

20

40

60

Dimension

(a) 10 points each

80

100

0

20

40

60

80

100

Dimension

(b) 100 points each

Figure 7: Graphs of the log ratio of the radius and volume resulting from the -search and the centroid methods to calculate the enclosing sphere. distributed data points, the centroid will be approximately at the middle. This explains why the r-ratios (and correspondingly the V -ratios) for the enclosing spheres of 100 points are smaller than the ratios for the enclosing spheres of 10 points. For all the k-nearest neighbor searches experiments, we chose k = 21 (to simulate 20-nearest neighbour query from using points not in the dataset) and the query point was chosen randomly from the dataset{ . For the comparisons, we use the same metric used by White26 for assessing the performance, ie. the number of leaf nodes that has to be examined (ltouched), the number of internal nodes accessed (inode), the number of leaf nodes that actually contain any of the nearest-neighbor set (lused), and the number of last useful leaf node accessed during the search (llast). The last two numbers are used to measure the tree's suitability for similarity searches (with smaller llast and lused, we will have a better approximate query performance). We did around 1000 random queries using points from the dataset chosen using Bernoulli trial and the values listed in the tables and the corresponding graphs are the average. The criteria used for the branch-and-bound nearest-neighbor search are the same for all SS+ -tree with any bounding envelopes (we used the nearest centroid criterion)k. The comparison between the SS+ -tree using bounding boxes and spheres (Figure 8 and Table 2 respectively) support our conjecture in Section 3.3. However, as we shall see later, this result is actually dependent on the shape of the cluster of points within each node, which in turn depends on the splitting heuristics used and the data distribution. The version of SS+ -tree using approximation to the smallest enclosing spheres produced by the search performed best for all trial. The version using bounding spheres calculated using centroid-farthestpoint method su ered form the extra space covered by the bounding spheres. The version using bounding boxes { the k We

same setting used for testing the original SS-tree26 actually have tried several branching criteria for the nearest-neighbor search, including the minimum distance (the optimistic measure), minimum of the maximum distance (the pessimistic measure), and several other heuristics. Our preliminary experiments suggested that the nearest centroid criterion was the best in terms of minimizing the number of leaf nodes that had to be examined.

SS+ -tree (Gamma) SS+ -tree (Cntrd) Box ltouch lused llast inodes ltouch lused llast inodes ltouch lused llast inodes 4.05 3.04 2.32 3.00 4.47 2.97 2.32 3.15 3.89 3.07 2.28 2.65 16.55 6.48 10.57 8.07 20.36 6.56 12.78 9.58 17.51 6.27 10.48 6.76 49.20 9.61 28.48 16.02 63.96 9.95 35.27 18.59 58.51 10.01 32.19 15.26 135.86 13.70 64.26 17.00 166.20 13.23 64.32 19.00 175.41 13.73 63.75 20.00 297.57 17.27 98.21 19.00 365.75 16.57 92.69 19.00 410.26 17.02 92.08 20.00

dim 2 4 6 8 10

Table 2: A comparison of the SS+ -tree using spheres calculated with -search, spheres calculated using centroid and farthest point, and the SS+ -treee using bounding boxes for uniformly distributed data (size 20000). 450

18

SS+-Gamma SS+-Ctrd

400

SS+-Box

350

14

300

12

250

Leaves Used

200

60 Last Leaf 50 Used 40

10 8 6 4

20

50

2

10

0

0 4

6 Dimension

(a) Leaves touched

8

10

SS+-Box

70

100

2

SS+-Cntrd

80

150

0

SS+-Gamma

90

SS+-Cntrd

SS+-Box

Leaves Touched

100

SS+-Gamma

16

30

0 0

2

4

6

8

10

0

2

Dimension

(b) Leaves used

4

6

8

10

Dimension

(c) The last used leaf

Figure 8: A comparison of the SS+ -trees using bounding spheres and bounding boxes. performed worse, which could be attributed to the amount of wasted area occurred from covering spherical shaped nodes using boxes. We had also done experiments with a mixture of bounding boxes and bounding spheres (Figure 2c), but we did not get good result. This can be explained because when we bound spheres using boxes we will get a penalty (the wasted area) that will increase with the dimension and a similar case will also happen when we try to bound boxes using spheres. dim 10 20 50 100

ltouch 65.36 104.19 157.26 293.83

SS+ -tree (Box) SS+ -tree (Sphere) lused llast inodes ltouch lused llast inodes 8.39 24.77 8.47 64.42 8.71 24.87 6.96 9.05 31.62 7.75 90.27 9.53 31.28 7.02 10.51 37.83 11.17 134.11 10.50 38.15 10.19 14.83 79.71 57.70 226.34 14.59 79.98 35.06

ltouch 85.84 115.9 158.1 309.7

SS-tree lused llast inodes 9.793 28.37 7.97 10.74 38.51 7.00 13.65 55.73 9.98 19.06 104.7 34.5

Table 3: A comparison of the SS+ -tree using bounding boxes, SS+ -tree using bounding spheres, and the SS-tree for the EigenFace dataset (size 7562). Table 3 and Figure 9 show that the SS+ tree consistently accessed fewer nodes than SS-tree for the 21-nearest neighbour searches in the eigenface data The values listed in Table 3 and Figure 9 are for SS+ -tree with bounding rectangles. The SS+ -tree with bounding spheres performed only slightly better than the SS-tree although both versions of the SS+ -tree have smaller lleaf and lused. This result is due to the nature of the eigenface data (Figure 10) that has a big scale di erence between axes. Covering points with this distribution using spheres will result in covering a lot of empty space, due to the spheres' isotropic nature. Another observation is that SS+ -tree actually accessed a higher number of internal nodes compared to SS-tree for the eigenface data. We attributed

350

20

SS+-Sphere SS+-Box

300

120

SS+-Sphere

SS+-Sphere SS+-Box

SS+-Box

18

SS-tree

100

SS-tree

SS-tree

16

250

80 14 200 Leaves Touched

Last Leaf 60 Used

Leaves 12 Used 150 10

40 100

8

20

50

6

0

4 0

20

40

60

80

Dimension

100

0 0

20

40

60

80

100

0

20

(a) Leaves touched

(b) Leaves used

40

60

80

100

Dimension

Dimension

(c) The last used leaf

Figure 9: A comparison of the SS+ -tree using bounding boxes, SS+ -tree using bounding spheres, and the SS-tree for the EigenFace dataset (size 7562). this result to the slightly high fragmentation resulting from the frequent application of the unrestricted local reorganization rule which ignored under owed nodes. 15000 10000 5000 Range

0 -5000 -10000 -15000 Dimension

Figure 10: The range of values of the EigenFace dataset

5. CONCLUSION Bounding spheres seem to perform better if we have a similar range of values for every axis. For data with unequal distribution for each axis, it might be better to use ellipsoid envelopes if we use a variance minimizing split heuristic. Compared to spheres, bounding boxes can adapt better to data with non-symmetrical distribution. The k-means splitting and reorganizing heuristics result in a tree with a low variance { a low variance will also mean we have more compact nodes. The k-means reorganization rule helps reduce the overlappings. There

are cases where overlappings are unavoidable, in which case it is better not to split the node at all and have one big node in place of several heavily overlapping ones (similar to the supernodes in X-trees4 ).

ACKNOWLEDGEMENTS We thank David White for providing the SS-tree source code and his test data and David Madore for his discussion about the properties of boxes and spheres in a high-dimensional Euclidean space.

REFERENCES [1] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching. 5th Ann. ACM-SIAM Symposium on Discrete Algorithms, pages 573{582, 1995. revised version. [2] F. Aurenhammer. Voronoi diagrams | A survey of a fundamental geometric data structure. ACM Computing Surveys, 23(3):345{405, Sept. 1991. [3] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R -tree: an ecient and robust access method for points and rectangles. ACM SIGMOD, pages 322{331, May 1990. [4] S. Berchtold, D. A. Keim, and H.-P. Kriegel. The X-tree: An index structure for high-dimensional data. In Proc. 22th Int. Conf. on Very Large Data Bases, pages 28{39, Bombay, India, 1996. [5] J. G. Cleary. Analysis of an algorithm for nding nearest neighbors in Euclidean space. ACM Transactions on Mathematical Software, 5(2):183{192, June 1979. [6] D. Elzinga and D. Hearn. The minimum covering sphere problem. Management Science, 19(1):96{104, sep 1972. [7] D. J. Elzinga and D. W. Hearn. Geometrical solutions for some minimax location problems. Transportation Science, 6(4):379{394, 1972. [8] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proc. ACM SIGMOD Conf., pages 47{57, Boston, MA, June 1984. Also published in/as: UCB, Elec.Res.Lab, Res.R. No.M83-64, 1983, with Stonebraker,M. and reprinted in M. Stonebraker, Readings in Database Sys., Morgan Kaufmann, San Mateo, CA, 1988. [9] H. V. Jagadish. Indexing for retrieval by similarity. In V. Subrahmanian and S. Jajodia, editors, Multimedia Database Systems: Issues and Research Directions. Springer-Verlag, Berlin, 1996. [10] J. Jin, L. S. Tiu, and S. W. S. Tam. Partial image retrieval in multimedia databases. In Proceedings of Image and Vision Computing New Zealand, pages 179{184, Christchurch, 1995. Industrial Research Ltd. [11] J. S. Jin, B. W. Lowther, D. J. Robertson, and M. E. Je eries. Shape representation and pattern matching under the multi-channel theory. In Proceedings of The 3rd Paci c Rim International Conference on Arti cial Intelligence, pages 970{975, Beijing, 1994. [12] I. Kamel and C. Faloutsos. On packing R-trees. Second Int. Conf. on Information and Knowledge Management (CIKM), pages 490{499, Nov. 1993. [13] K.-I. Lin, H. V. Jagadish, and C. Faloutsos. The TV-tree: An index structure for high-dimensional data. VLDB Journal, 3(4):517{549, Oct. 1994.

[14] J. B. MacQueen. Some methods for classi cation and analysis of multivariate observations. In Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, volume 1(5), pages 281{297, 1967. [15] N. Megiddo. Linear-time algorithms for linear programming in R3 and related problems. SIAM Journal on Computing, 12:759{776, 1983. [16] M. Minsky and S. Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, 1969. [17] W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, and G. Taubin. The QBIC project: Querying images by content using color, texture and shape. SPIE 1993 Intl. Symposium on Electronic Imaging: Science and Technology, Storage and Retrieval for Image and Video Databases, 1908:173{187, Feb. 1993. Also available as IBM Research Report RJ 9203 (81511), Feb. 1, 1993, Computer Science. [18] B.-U. Pagel, H.-W. Six, H. Toben, and P. Widmayer. Towards an analysis of range queries performance in spatial data structures. In Proc. ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Sys., Washington, DC, May 1993. [19] A. Pentland, R. W. Picard, and S. Sclaro . Photobook: Content-based manipulation of image databases. In SPIE Storage and Retrieval Image and Video Databases II, volume 2185, San Jose, Feb 1995. SPIE. Also in International Journal of Computer Vision, Fall 1995. [20] M. Schroeder. Fractals, Chaos, Power Laws: Minutes From an In nite Paradise. W.H. Freeman and Company, New York, 1991. [21] T. Sellis, N. Roussopoulos, and C. Faloutsos. The R+ tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th Conference on Very Large Databases, pages 507{518, Los Altos, CA, Sept. 1987. Morgan Kaufman. also available as SRC-TR-87-32, UMIACS-TR-87-3, CS-TR-1795. [22] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71{86, 1991. [23] P. J. M. van Oosterom. Reactive data structures for geographic information systems. Oxford University Press, New York, 1993. [24] E. Welzl. Smallest enclosing disks (balls and ellipsoids). In H. Maurer, editor, Proceedings of New Results and New Trends in Computer Science, pages 359{370. LNCS 555. Springer, June 1991. [25] D. A. White and R. Jain. Algorithms and strategies for similarity retrieval. Technical Report VCL-96-01, Visual Computing Laboratory, University of California, San Diego, 9500 Gilman Drive, Mail Code 0407, La Jolla, CA 92093-0407, July 1996. [26] D. A. White and R. Jain. Similarity indexing: Algorithms and performance. In Proceedings of the SPIE: Storage and Retrieval for Image and Video Databases IV, volume 2670, San Jose, CA, Feb. 1996. [27] D. A. White and R. Jain. Similarity indexing with the SS-tree. In Proc. 12th IEEE International Conference on Data Engineering, New Orleans, Louisiana, Feb. 1996.