DOLPHIN: An Efficient Algorithm for Mining ... - Semantic Scholar

11 downloads 1737 Views 687KB Size Report
disk-resident datasets and whose I/O cost corresponds to the cost of sequentially ... outliers in very large datasets. ACM Trans. Knowl. Discov. Data. 3, 1, Article 4 ...
DOLPHIN: An Efficient Algorithm for Mining Distance-Based Outliers in Very Large Datasets FABRIZIO ANGIULLI and FABIO FASSETTI DEIS, Universita` della Calabria

In this work a novel distance-based outlier detection algorithm, named DOLPHIN, working on disk-resident datasets and whose I/O cost corresponds to the cost of sequentially reading the input dataset file twice, is presented. It is both theoretically and empirically shown that the main memory usage of DOLPHIN amounts to a small fraction of the dataset and that DOLPHIN has linear time performance with respect to the dataset size. DOLPHIN gains efficiency by naturally merging together in a unified schema three strategies, namely the selection policy of objects to be maintained in main memory, usage of pruning rules, and similarity search techniques. Importantly, similarity search is accomplished by the algorithm without the need of preliminarily indexing the whole dataset, as other methods do. The algorithm is simple to implement and it can be used with any type of data, belonging to either metric or nonmetric spaces. Moreover, a modification to the basic method allows DOLPHIN to deal with the scenario in which the available buffer of main memory is smaller than its standard requirements. DOLPHIN has been compared with state-of-the-art distance-based outlier detection algorithms, showing that it is much more efficient. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Performance Additional Key Words and Phrases: Data mining, outlier detection, distance-based outliers ACM Reference Format: Angiulli, F. and Fassetti, F. 2009. DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans. Knowl. Discov. Data. 3, 1, Article 4 (March 2009), 57 pages. DOI = 10.1145/1497577.1497581 http://doi.acm.org/10.1145/1497577.1497581

A preliminary version of this article appears in the Proceedings of the 2007 International Conference on Information and Knowledge Management (CIKM’07) [Angiulli and Fassetti 2007]. Authors’ addresses: F. Angiulli and F. Fassetti, DEIS, Universita` della Calabria, Via P. Bucci, 41C, 87036 Rende(CS), Italy; email: {f.angiulli, f.fassetti}@deis.unical.it. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected].  C 2009 ACM 1556-4681/2009/03-ART4 $5.00 DOI 10.1145/1497577.1497581 http://doi.acm.org/10.1145/1497577.1497581 ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4

4:2



F. Angiulli and F. Fassetti

1. INTRODUCTION An outlier is an observation that differs so much from other observations as to arouse suspicion that it was generated by a different mechanism [Hawkins 1980]. There exist several approaches to the identification of outliers, namely, statistical-based [Barnett and Lewis 1994], deviation-based [Arning et al. 1996], distance-based [Knorr and Ng 1998], density-based [Breunig et al. 2000; Jin et al. 2001], projection-based [Aggarwal and Yu 2001], MDEF-based [Papadimitriou et al. 2003], and others. Outliers were firstly studied in the field of statistics: Barnett and Lewis [1994] provided about one hundred discordance tests for many standard data distributions. Appropriate discordance tests have been defined basing on the distribution, on the knowledge of the parameters, on the number of expected outliers, and on their type. These tests are very specialized, since each of them is designed only for a specific distribution. However, in practice, the distribution of the data can be unknown, or no discordance test may be available for the data distribution, or also the data could not fit any standard distribution. Furthermore, discordance tests are suitable only for univariate data. To overcome limitations of statistical outlier definitions, Knorr and Ng [1998] introduced the notion of distance-based outlier, reported next. Definition 1.1. Let DS be a set of objects, also said dataset, k be a positive integer, and R be a positive real number. An object obj of DS is a DB (k, R)outlier, or a distance-based outlier with respect to parameters k and R, or, simply, an outlier, if less than k objects in DS lie within distance R from obj. Objects lying at distance not greater than R from obj are called neighbors of obj . Each object obj is considered a neighbor of itself. This definition assumes that a distance function relating each pair of dataset objects is available. Knorr and Ng [1998] showed that the distance-based outlier definition generalizes the notion of outlier provided by several discordance tests developed in statistics. Furthermore, this definition is suitable for multivariate data and can also be applied even if the distribution of the data is unknown. Some variants of Definition 1.1 have been subsequently introduced in the literature [Ramaswamy et al. 2000; Angiulli and Pizzuti 2002]. Moreover, distance-based definitions have been proved a useful tool for data analysis in different contexts [Knorr and Ng 1999; Eskin et al. 2002; Lazarevic et el. 2003]. This work deals with the problem of efficiently detecting distance-based outliers in huge collections of data. The outlier detection problem is defined as follows: Given a dataset DS, a positive integer k, and a positive real number R, find all the DB (k, R)-outliers in DS. From a theoretical point of view, the outlier detection problem is easy, since it can be solved in quadratic time with respect to the dataset size by computing all the pairwise distances among the dataset objects. However, since ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:3

in applications outliers are usually searched for in large collections of data, from a practical point of view, the quadratic algorithm may be unsatisfactory or impractical. Distance-based outlier scores have the interesting property of being monotonic nonincreasing function of the portion of the dataset already explored. Basing on this property, in recent years several clever algorithms have been proposed to fast detect distance-based outliers [Knorr et al. 2000; Ramaswamy et al. 2000; Bay and Schwabacher 2003; Angiulli and Pizzuti 2005; Ghoting et al. 2006; Tao et al. 2006]. Some of them are efficient in terms of CPU cost, while some others are interested in minimizing the I/O cost. However, it is worth to notice that none of them is able to simultaneously achieve the two previous goals on multidimensional disk-resident datasets. The contribution of this work can be summarized as follows. —A novel distance-based outlier detection algorithm, named DOLPHIN (Detecting OutLiers PHushing objects into an INdex), working on disk-resident datasets and whose I/O cost corresponds to the cost of sequentially reading twice the input dataset file is presented. —Both theoretical justification and experimental evidence that, for wellfounded combinations of the parameters R and k, the main memory usage of DOLPHIN amounts to a small fraction of the dataset, are provided. —It is shown both analytically and experimentally that DOLPHIN has linear time performance with respect to the dataset size. —DOLPHIN gains efficiency by naturally merging together in a unified schema three strategies: (1) the selection policy of objects to be maintained in main memory, (2) usage of pruning rules, and (3) similarity search techniques. Importantly, similarity search is accomplished by the algorithm without the need of preliminarily indexing the whole dataset, as other methods do, and both worst-case spatial and temporal performances aforementioned are independent of the effectiveness of strategies (2) and (3). —A method based on sampling theory to meaningfully set the parameter R for the distance-based outlier definition without trial and error is presented. —DOLPHIN has been compared with state-of-the-art distance-based outlier detection algorithms, showing that it is much more efficient. —DOLPHIN is simple to implement and it can be used with any type of data, belonging to either metric or nonmetric spaces. The rest of the work is organized as follows. Section 2 describes the DOLPHIN algorithm. This algorithm performs two dataset scans. While scanning the dataset it maintains in main memory a data structure, called INDEX, storing some dataset objects. The INDEX structure represents a summary of the dataset and it is used in order to early recognize inliers. Section 3 studies the spatial cost of DOLPHIN. In particular, an upper bound sup to the size of the INDEX data structure is derived. It is shown that, for ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:4



F. Angiulli and F. Fassetti

large datasets, the size of INDEX can be upper bounded by the ratio kp , where p denotes the probability that a randomly picked object of the dataset and a randomly picked object of INDEX are neighbors. In Section 4, basing on the notion of unification between outlier definitions introduced in Knorr and Ng [1998] and on the concept of outlier region for a statistical distribution [Davies and Gather 1989], a methodology to compute the value p f of the probability p, once the distribution f of the dataset is known, is derived, and then such a value is computed for some standard distributions (Sections 4.1 and 4.2). As a major result, it is shown that for well-founded values k f and R f for the parameters k and R, that are the values for which the statistical definition associated with the distribution f and the distancebased definition unify, the upper bound sup corresponds to a small fraction of the overall dataset. The aforesaid theoretical analysis is then empirically validated (Section 4.3). Furthermore, it is shown that the number of objects accommodated in main memory by DOLPHIN is far smaller than the upper bound sup . Section 5 accounts for the temporal cost analysis of the algorithm. It is shown that the time complexity of DOLPHIN is O( kp N d ), where N denotes the dataset size and d the dataset dimensionality, hence the algorithm has linear-time performance in the dataset size N . In Section 6 the basic schema of DOLPHIN is modified in order to deal with the scenario in which the available memory is smaller than the space required by the INDEX data structure. Section 7 surveys distance-based outlier definitions and related methods and compares the approach here presented with other approaches, pointing out major differences and contributions. Section 8 describes experimental results. The theoretical analysis conducted in previous sections is validated. Also, the sensitivity of the INDEX size course to the parameters R and k, the memory requirements, the execution time, and the effectiveness of the pruning rules are analyzed. Moreover, a method to set parameter R is introduced. DOLPHIN is compared with state-of-the-art algorithms, showing that it is much more efficient. Finally, the sensitivity of the fixed-memory DOLPHIN to the buffer size is studied. Finally, Section 9 reports conclusions. 2. ALGORITHM In this section the algorithm DOLPHIN is described. The algorithm receives as input a disk-resident dataset DS and parameters k and R, and outputs all and only the DB (k, R)-outliers of DS. DOLPHIN makes use of a data structure called INDEX, which is a DBOindex (defined next), where DBO is the acronym for Distance-Based Outlier. First of all, the definition of DBO-node, which is the building block of the DBOindex, is provided. Definition 2.1. A DBO-node n, or simply node, is a data structure containing the following information: ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:5

—n.obj . an object of DS; —n.id . the record identifier of n.obj in DS; —n.nn. an array consisting of h integers, where h is an user-defined parameter; the ith integer n.nn[i] (1 ≤ i ≤ h) is a lower bound to the number of objects of DS whose distance from n.obj is in the interval g ( Rh · (i − 1), Rh · i].1 Note that n.nn does not take into account the object n.obj . A DBO-index is a data structure based on DBO-nodes, as defined in the following. Definition 2.2. A DBO-index INDEX is a data structure storing DBO-nodes and providing a range query search method. A range query search method receives as input an object obj (also called center of the search) and a real number R ≥ 0 (also called radius of the search) and returns a superset of the nodes in the DBO-index associated with objects whose distance from obj is not greater than R. The size of DBO-index is the number of DBO-nodes it stores. Figure 1 shows the algorithm DOLPHIN. It performs two sequential scans of the dataset file. During the first scan, the DBO-index INDEX is employed to maintain a summary of the portion of the dataset already examined. In particular, for each incoming dataset object obj , the nodes already stored in INDEX are exploited in order to determine if obj is an inlier. The object obj will be inserted into INDEX if, according to the policy described in the following, it is not recognized as an inlier. By adopting the strategy depicted in Figure 1 it is guaranteed that INDEX contains all the outliers occurring in the portion of the dataset already scanned. However, not all the objects stored in INDEX are outliers. After having picked the next object obj from the dataset (line 2), the function isInlier is called in order to check whether obj can be recognized as an inlier (line 4). If obj is not recognized as an inlier, then it is inserted into INDEX (line 5). During the execution of the first scan, some of the objects stored in INDEX can be recognized as inliers on the basis of the objects read from the dataset after them (this task is accomplished by the function isInlier). These objects are called in the following proved inliers. When the first dataset scan finishes, INDEX contains a superset of the dataset outliers. The objects stored in INDEX which are not proved inliers are called candidate outliers. The outliers are a subset of the candidate outliers, and in order to isolate them a second scan is needed. Before starting the second dataset scan the proved inliers are removed from INDEX, since they are no longer useful (line 6). Furthermore, all the elements of the n.nn arrays associated with the nodes n kept in INDEX are set to zero. During the second scan, for each dataset object, the procedure pruneInliers is called in order to remove inliers from INDEX (lines 7–8). At the end of the second dataset scan, INDEX contains all and only the outliers of DS. Next, the function isInlier and the procedure pruneInliers are detailed. 1 The

first interval includes also the value 0.

ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:6



F. Angiulli and F. Fassetti

Fig. 1. The DOLPHIN distance-based outlier detection algorithm.

Function isInlier. The function isInlier takes as input the DBO-node ncurr associated with the current dataset object ncurr .obj. First of all, a range query search with center ncurr .obj and radius R is performed in INDEX (line 10). For each DBO-node nindex returned by the search, the distance dst between ncurr .obj and nindex .obj is computed. For a generic object n.obj, the radius of a hypersphere centered in n.obj and containing at least k−1 dataset objects other than n.obj can be obtained from the array n.nn as follows. The value of the radius associated with the object n.obj, denoted in the following by n.rad, iscomputed as Rh · i, where i (1 ≤i ≤ h) is the smallest integer such that the sum j ≤i n.nn[ j ] is at least k −1. If j ≤h n.nn[ j ] ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:7

is smaller than k − 1 then n.rad is +∞. Note that n.rad represents an upper bound to the actual radius of the hypersphere centered in n.obj and containing at least k − 1 dataset objects other than n.obj, and, hence, at least k dataset objects. Clearly enough, the higher the value of h, the more accurate this upper bound. Thus, if dst ≤ R − nindex .rad, then, by the triangle inequality, within distance R from ncurr .obj there are at least k objects, and ncurr .obj is not an outlier. In this case the range query is stopped and the next dataset object is considered (lines 13–14). The aforesaid rule is used to early prune inliers. The more densely populated the region the object lies in, the higher the chance for the object of being recognized as an inlier by means of this rule. This can be intuitively explained by noticing that the radius of the hyperspheres associated with the objects lying in proximity of the object is inversely proportional to the density of the region. This is the first rule used by the algorithm to recognize inliers. Since other rules will be used to reach the same goal, this one will be called Pruning Rule 1 (PR1 for short). Otherwise, if dst ≤ R then the array nindex .nn of the neighbor distances of nindex .obj is updated with dst (line 17) and also the array ncurr .nn of the neighbor distances of ncurr .obj is updated with dst (line 20). After having updated the array nindex .nn, if the radius nindex .rad becomes less than R, then nindex .obj is recognized as an inlier (line 18). These kinds of objects are called the proved inliers. When an object becomes a proved inlier, two diametrically opposed strategies can be adopted. According to the first one, the node nindex is removed from INDEX since it is no longer a candidate outlier (recall that an object is inserted into INDEX if and only if it has not been possible to determine that it is an inlier). This strategy has the advantage of releasing space as soon as it is not strictly needed and of making less expensive the cost of the range query search when its cost is related to the size of INDEX. However, it may deteriorate inlier detection capabilities, since the PR1 becomes ineffective. Indeed, if this strategy is used, then the value n.rad associated with each node n stored into INDEX is always +∞, otherwise the node n is a proved inlier and it is removed from INDEX. According to the second strategy, the node nindex is maintained in INDEX since it could be exploited to detect subsequent dataset inliers through the PR1. Between the aforementioned two opposite strategies, there is also a third intermediate one, that is, to maintain a fraction of the proved inliers. This strategy has some advantages with respect to the other two. Indeed, even though the second strategy makes PR1 effective, it must be said that it may introduce an high degree of redundancy. Indeed, often objects share neighbors with many other dataset objects since real data is clustered. Thus, it is better to maintain only a portion of proved inliers. The parameter pinliers ∈ [0, 1] represents the fraction of proved inliers that will be maintained in INDEX. According to the third strategy, if nindex .rad is greater than R before updating nindex .nn and less than or equal to R after updating nindex .nn, then the node nindex is kept in INDEX with probability pinliers , while it is removed from ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:8



F. Angiulli and F. Fassetti

INDEX with probability (1 − pinl iers ) (lines 18–19). This pruning rule will be referred to in the following as PR2. The effect of the parameter pinliers on the size of INDEX and on the ability of early recognizing inliers will be studied in Section 8.1. As for the current dataset object ncurr .obj, if ncurr .rad becomes less than or equal to R, then it is recognized as an inlier. In this case the range query is stopped and the object is reported as an inlier (line 22). This is the third pruning rule of inliers, for short referred to as PR3 in the following. Finally, if ncurr .obj is not recognized as an inlier after having compared it with all the objects returned by the range search, then it is reported as not an inlier (line 23). Procedure pruneInliers. The procedure pruneInliers takes as input a dataset object obj. First, it performs a range query search in INDEX with center obj and radius R (line 24). This search returns a superset of the objects nindex .obj of INDEX such that obj lies in the neighborhood of radius R of nindex .obj. Thus, if the distance between nindex .obj and obj is less than or equal to R, then the array of the neighbor distances of nindex .obj is updated with the distance from obj. If nindex .rad becomes less than or equal to R, then nindex .obj becomes a proved inlier and it is removed from INDEX. This terminates the description of the algorithm. The next section studies the memory requirements of the method, that is, the number of dataset objects that are maintained in the INDEX structure. 3. SPATIAL COST ANALYSIS In this section the spatial cost of DOLPHIN is studied. An upper bound sup to the size of INDEX is derived and then employed to show that the memory usage of the algorithm corresponds to a small fraction of the dataset. 3.1 INDEX Size Upper Bound In this section an upper bound to the size of INDEX is determined. Before starting the analysis, the following preliminary definition is needed. Definition 3.1. With p it is denoted the probability that a randomly picked object of DS and a randomly picked object of INDEX are neighbors, that is, the probability that their distance is less than or equal to R. Note that the parameter p depends on R but not on k. Without loss of generality, in the following assume that the dataset objects are randomly ordered. The upper bound to the size of INDEX will be obtained under the following two assumptions. (A1) No inlier is recognized through the PR1, that is, the information stored in the array n.nn is never exploited; and (A2) all the proved inliers are maintained in INDEX, that is, pinliers is set to one so that PR2 is disabled. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:9

Intuitively, both A1 and A2 are penalizing for the algorithm, as will be confirmed in the following. The upper bound does not take into account outliers. Indeed, it is safe to assume that they form a negligible fraction of the dataset (either of the order of or less than the 1‰). The following theorem accounts for the upper bound. THEOREM 3.2. Under assumptions A1 and A2, with probability 0.999 it holds that the size of INDEX is upper bounded by sup , where  (k − 1)(1 − p) k−1 sup = +δ· + 5, p p with δ a positive real number depending on k, p, and the size N of the dataset. PROOF. First of all, the probability that a generic inlier obj finds k − 1 neighbors in an INDEX having size s is necessary to compute. This problem can be modeled as a set of independent Bernoulli trials where instances are kept drawing until k − 1 successes are found. In this setting, a trial consists in the comparison between obj and one of the s dataset objects, whereas a success occurs when the compared object is a neighbor of obj. Let S denote a random variable representing the number of trials until exactly k − 1 successes are achieved. The random variable S follows a negative binomial distribution, hence   s − 1 k−1 Pr(S = s) = p (1 − p)s−k+1 . (1) k−2 The expected value E[S] and the variance var[S] of S are, respectively, E[S] =

k−1 , p

and

var[S] =

(k − 1)(1 − p) . p2

Recall that an object obj is inserted into INDEX if either (i) it has less than k − 1 neighbors in INDEX, or (ii) there not exists an object obj in INDEX such that dist(obj, obj ) + obj  .rad ≤ R. By assumption A1, and since an upper bound to the size of INDEX has to be stated, the latter condition can be safely disregarded. Let ps = Pr (S ≤ s) be the probability that a dataset object finds k − 1 neighbors in INDEX when INDEX has size s. From what has been stated before, the value ps represents also the probability that an object will not be inserted into INDEX when INDEX has size s. Thus, an upper bound to the size of INDEX can be obtained by first determining the value sup for s such that N − s = 1−1ps holds. Indeed, if INDEX has size s, then the number of objects to be considered for insertion in INDEX is at most N − s. The value sup is such that the following equation is satisfied: Pr (S ≤ sup ) = 1 −

1 , N − sup

(2)

ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:10



F. Angiulli and F. Fassetti

where the first term is the cumulative distribution function of the negative binomial distribution. In other words, when INDEX has size sup , the expected number of objects that will be inserted in INDEX after having examined the remaining (at most) N − sup objects is equal to one. Let X s denote a random variable following a binomial distribution with probability of success equal to 1 − ps , that is, X s represents the number of insertions in INDEX assuming that the probability of insertion is held fixed to 1 − ps . Let Y s denote a random variable representing the number of insertions in INDEX when INDEX has size s. Since when a novel object is inserted in INDEX the probability of insertion decreases, then Pr (X s ≤ y) ≤ Pr (Y s ≤ y), and, thus, X s represents an upper bound for Y s . In order to statistically bound the size of INDEX, the value y such that Pr (Y sup ≤ y) ≥ 0.999 has to be computed. By what has been said, Pr (Y sup ≤ y) ≥ Pr (X sup ≤ y) ≥ 0.999 and the value of y can be safely determined by requiring that Pr(X sup ≤ y) ≥ 0.999. The preceding inequality is satisfied for y = 5. To conclude, sup can be formulated as the value for the random variable S, which is at δ standard deviations from the expected value of S, then   (k − 1)(1 − p) k−1 sup = sup + 5 = E[S] + δ · var[S] + 5 = +δ· + 5, p p where δ is such that condition (2) is satisfied. Next it will be empirically shown that the value of δ is “small” (less than 7 for datasets of size up to N = 109 ). In order to use the aforesaid upper bound, the value of the parameter p associated with the input dataset has to be computed. This is accounted for in Section 3.2 and Section 4. Now the value sup is derived as a function of the parameters p and k. Recall that the cumulative density function Pr (S ≤ s) associated with the negative binomial random variable S can be formulated as Pr(S ≤ s) = I p (k − 1, s − (k − 1) + 1) = I p (k − 1, s − k + 2), where I p denotes the regularized incomplete beta function, that is, I p (a, b) =

B( p; a, b) , B(a, b)

and B( p; a, b) and B(a, b) denote the incomplete and complete beta function, respectively [Rider 1962]. Hence, the actual value of sup can be obtained by substituting the previous expression in Eq. (2) and then setting sup so that the following equality holds. I p (k − 1, sup − k + 2) = 1 −

1 N − sup

(3)

ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.



DOLPHIN: Algorithm for Mining Distance-Based Outliers

4:11

Table I. Values of sup /N (a) p = 1% N  0.01‰ 0.05‰ 0.1‰ 0.5‰ N  0.5‰ 0.25‰ 0.5‰ 2.5‰

(b) p = 2.5%

106

107

108

109

0.31% 0.90% 1.53% 6.21%

0.16% 0.62% 1.17% 5.37% (c) p = 5%

0.12% 0.54% 1.06% 5.13%

0.11% 0.51% 1.02% 5.04%

106

107

108

109

0.18% 0.66% 1.22% 5.48%

0.12% 0.55% 1.07% 5.16%

0.11% 0.52% 1.02% 5.05%

0.10% 0.51% 1.01% 5.02%

N  0.025‰ 0.125‰ 0.25‰ 1.25‰ N  0.1‰ 0.5‰ 1‰ 5‰

106 0.22% 0.73% 1.32% 5.69% (d)

107 0.14% 0.58% 1.11% 5.23% p = 10%

108

109

0.12% 0.54% 1.06% 5.13%

0.10% 0.51% 1.01% 5.03%

106

107

108

109

0.15% 0.61% 1.15% 5.33%

0.12% 0.54% 1.05% 5.11%

0.11% 0.51% 1.02% 5.04%

0.10% 0.50% 1.01% 5.01%

Table I reports the value of the ratio sup /N for N in the range [106 , 109 ] and for various combinations of the parameters p and  = k/N . In particular, p ranges in [1%, 10%] and then  was set to 0.1% p, 0.5% p, 1% p, and 5% p. As for the value of the parameter δ, it must be said that for all the combinations of values considered it ranged approximately from 4.82–6.29. It can be noticed that for most combinations of p and  the ratio sup /N is approximately equal to the ratio / p. This can be explained by exploiting the analysis of Theorem 3.2. Indeed     sup  (k − 1)(1 − p) 1 k−1  = +δ· +5    N k=N N p p k=N  N − 1 (N − 1)(1 − p) 5 = +δ· + pN pN N    1 1 1 5 = − − +δ· √ (1 − p) + . (4) p pN N N p N Note that for N sufficiently large, the second, the third, and the fourth terms of the last expression become negligible with respect to the first term. Hence, for large datasets, the upper bound sup can be approximated to kp . 3.2 Distribution of the Objects in INDEX Now we present a discussion on how the objects stored in INDEX are distributed under the assumptions stated in previous section. This is needed in order to compute the value of the parameter p for the dataset at hand and, hence, the upper bound sup to the size of INDEX. By the assumption A1, DOLPHIN inserts an object into INDEX if in INDEX there are less than k − 1 neighbors of the object. Given a dataset object x, define πk (x) as πk (x) = Pr (x has at least k − 1 neighbors in INDEX). Hence, the probability Pr (Insert x) of inserting x in INDEX can be formulated ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:12



F. Angiulli and F. Fassetti

as Pr (Insert x) = Pr (X = x) · (1 − πk (x)), where X is a random variable distributed according to the dataset population. At the beginning of the execution of the algorithm, the probability of having at least k − 1 neighbors in INDEX is zero for all the dataset objects, as INDEX is empty. As objects arrive, the probability of having less than k − 1 neighbors decreases for all the objects, but, the higher the value Pr (X = x), the faster the term πk (x) approaches one for the object x. After having “saturated” a region Reg of the feature space, that is, after the probability πk (x) for objects x belonging to Re g is sufficiently close to one, DOLPHIN tends to insert only objects x  coming from regions Re g  of smaller probability, that is, such that Pr (X = x  ) < Pr(X = x). Clearly enough, this strategy has some advantages: Indeed, if objects were always inserted in INDEX according to the data distribution, then in order to have the same vale of πk (x) for objects x belonging to lowly populated regions, the size of INDEX would be much greater, since the number of objects coming from highly populated regions stored in INDEX would be much larger. It can be concluded that for each neighborhood of radius R in the inlier region, DOLPHIN will store in INDEX about the same number of objects, and, hence, the objects in INDEX can be considered about uniformly distributed in the region of the feature space from which inliers come. The previous discussion will be empirically validated subsequently in Section 4.3. 4. ANALYSIS FOR KNOWN DISTRIBUTIONS This section shows how to obtain the value of p, and then of sup , when the distribution of the population is known (Section 4.1). Moreover, the value sup is computed for some standard distributions (Section 4.2). As a major result, it is shown that, for well-founded values of the parameters k and R, the upper bound sup corresponds to a very small fraction of the overall dataset. The empirical analysis of the derived results (Section 4.3) confirms that the objects in INDEX are uniformly distributed, that the upper bound is correct, and that the number of objects accommodated in main memory by the algorithm is even far smaller. 4.1 Derivation of the Parameter p Distance-based outliers have been introduced in Knorr and Ng [1997] as a generalization of statistical outlier definitions. Let us recall the definition of unification between outlier definitions. Definition 4.1 (Knorr and Ng 1997). The distance-based outlier definition unifies another definition Def for outliers, if there exist specific values k0 and R0 such that an object is an outlier according to Def if and only if it is a DB (k0 , R0 )outlier. It is also said that DB (k0 , R0 ) unifies Def, and vice versa. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:13

Many statistical outlier definitions assume that the distribution f of the dataset is known. In the following Def ( f ) denotes the statistical outlier definition associated with the distribution f. Let f be a distribution such that Def ( f ) is unifiable with the distance-based outlier definition, k f and R f denote the values for the parameters k and R such that the definition Def ( f ) unifies DB (k f , R f ). Moreover,  f denotes the value k f /N , where N is the number of dataset objects distributed according to f. Proposition 4.2, reported next, explains how parameters k f and R f can be obtained. PROPOSITION 4.2. Let f denote a probability density function, such that DB(k f , R f ) unifies Def ( f ). Then, k f and R f are such that, for each object x which is an outlier according to Def ( f ), the following relationship is satisfied: F (x + R f ) − F (x − R f ) ≤  f , where F is the cumulative density function associated with f. PROOF. The line of reasoning exploited in Knorr and Ng [1997] is employed. The distance-based definition DB (k f , R f ) returns all and only the outliers according to Def ( f ) if and only if each inlier (outlier, respectively) has at least (less than, respectively) k f objects within distance R f . Thus, given a random variable X distributed according to f and an object x, consider their difference |X − x|. N · Pr(|X − x| ≤ R f ) ≤ k f iff x is an outlier =⇒ Pr(x − R f ≤ X ≤ x + R f ) ≤  f iff x is an outlier =⇒ F (x + R f ) − F (x − R f ) ≤  f iff x is an outlier The actual values for k f and R f can be eventually obtained by enforcing that the last inequality holds for outliers but not for inliers. In order to identify outliers in statistical distributions, Davies and Gather [1993, 1989] introduced the notion of outlier region. Intuitively, the outlier region is the region to which less than the α fraction of the population belongs. In the following  denotes the whole object space. Definition 4.3. [Davies and Gather 1989]. Let F be an absolutely continuous cumulative distribution function with probability density function f. For any α, 0 < α < 1, the outlier region  f of F is defined as  f = {x | f (x) < δ(α)} where δ(α) = sup{δ > 0 | Pr( f (X ) < δ) ≤ α} and X has distribution function F . On the basis of the aforesaid definition, the following natural statistical definition Def ( f ) for the probability distribution f can be formulated. Definition 4.4. it lies in  f .

Any value x is an outlier according to Def ( f ) if and only if

The inlier region  f associated with the definition Def ( f ) is the set of objects classified as inliers by Def ( f ). Note that for each distribution f,  =  f ∪  f . ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:14



F. Angiulli and F. Fassetti

Let U f denote the random variable uniformly distributed in the inlier region  f , whose probability density function is φ f (u) = c f , if u ∈  f , and φ f (u) = 0 otherwise, where c f is 1/V( f ) and V( f ) denotes the volume of the region  f . Definition 4.5. Let Def ( f ) be a statistical outlier definition that unifies DB (k f , R f ), and let DS be a dataset distributed according to f. Assume that DOLPHIN is executed on DS with parameters k = k f and R = R f . Then, p f denots the probability that a randomly picked object of DS and a randomly picked object of INDEX are neighbors, assuming that the objects kept in INDEX are uniformly distributed according to the probability density function φ f (based on the analysis of Section 3.2). Now it is shown how the probability p f can be obtained once the data distribution f is known. THEOREM 4.6. p f is given by

Let f be a probability density function, then the probability

pf = c f

f

(F (u + R f ) − F (u − R f )) du,

(5)

where F is the cumulative distribution function associated with f. PROOF. Assume that the objects in INDEX are uniformly distributed in the inlier region  f . Let X be a random variable distributed according to f, then the probability p f that a dataset object has a neighbor in INDEX can be computed as p f = Pr(|X − U f | ≤ R f )

= Pr(U f − R f ≤ X ≤ U f + R f | U f = u) φ f (u) du 

= (F (u + R f ) − F (u − R f )) φ f (u) du 

= cf (F (u + R f ) − F (u − R f )) du. f

The inlier region  f depends on the value of α, which can be set to any interesting value in the range (0, 1). However, for standard distributions, wellestablished statistical tests often directly provide the inlier region  f and then, implicitly, a suitable value for α. 4.2 Analysis on Some Standard Distributions Next derived is the value for the parameter p f associated with some standard distributions f, that are the normal (Section 4.2.1), the exponential (Section 4.2.2), and the Laplace distributions (Section 4.2.3). 4.2.1 Normal distribution. Here data which is normally distributed is analyzed. First a well-known outlier test for normal distributions is recalled. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:15

Definition 4.7. [Barnett and Lewis 1994]. Let DS be a set of values that is normally distributed with mean μ and standard deviation σ . Define Def (norm) as follows: An object obj of DS is an outlier if and only if objσ−μ ≥ 3 or objσ−μ ≤ −3. This test clearly provides the inlier region norm = [μ − 3σ, μ + 3σ ], from which the value α = 0.0027 can be obtained. The following theorem states the relationship between the preceding definition and distance-based outliers. THEOREM 4.8 (KNORR AND NG 1997). The distance-based outlier definition unifies Def (norm) with parameters norm = 0.0012 and Rnorm = 0.13σ . The values knorm and Rnorm are computed as explained in Proposition 4.2. Now the value pnorm can be obtained. THEOREM 4.9.

The value of the parameter pnorm is 0.0432.

PROOF. By definition of De f (norm), the region norm is the interval [μ − 3σ, μ + 3σ ], and then cnorm = 1/λ(norm ) = 1/(6σ ). To conclude, the probability pnorm is computed according to Theorem 4.6.

pnorm = cnorm (F (u + Rnorm ) − F (u − Rnorm )) du norm

μ+3σ 1 = F (u + 0.13σ ) − F (u − 0.13σ ) du 6σ μ−3σ

1 3 =

(z + 0.13) − (z − 0.13) dz = 0.0432 6 −3 By substituting the value norm = 0.0012 and pnorm = 0.0432 in the formula Eq. (3), the following values for sup /N in correspondence of various values for N are obtained. N sup %

105 3.94%

106 3.16%

107 2.91%

+∞ 2.78%

The last column reports the value pnorm which, according to Eq. (4), represents norm the value of sup /N for N approaching infinity. 4.2.2 Exponential Distribution. Now, the parameter pexp is computed when the data is distributed according to an exponential distribution. An outlier test for this distribution can be defined by properly setting the value of α in Definition 4.3, as accounted for in Schultze and Pawlitschko [2002]. Definition 4.10. [Schultze and Pawlitschko 2002]. Let DS be a set of values that is exponentially distributed with parameter λ. Define De f (exp) as follows: An object x of DS is an outlier if and only if 1 x ≥ − ln α. λ Then, a value x is an outlier according to De f (exp) if and only if x belongs to exp = [− 1λ ln α, +∞). ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:16



F. Angiulli and F. Fassetti

THEOREM 4.11. The distance-based outlier definition unifies De f (exp) with parameters exp = (e0.1 − e−0.1 )α ≈ 0.2α and Rexp = 0.1 1λ . PROOF. Starting from Proposition 4.2, and by setting Rexp to 0.1 1λ and exp to (e0.1 − e−0.1 )α, it follows that     1 1 F x + 0.1 − F x − 0.1 ≤ (e0.1 − e−0.1 )α =⇒ λ λ 1 − e−λ(x+0.1 λ ) − (1 − e−λ(x−0.1 λ ) ) ≤ (e0.1 − e−0.1 )α =⇒ 1

1

1 (e0.1 − e−0.1 )e−λx ≤ (e0.1 − e−0.1 ) α =⇒ e−λx ≤ α =⇒ x ≥ − ln α. λ Now the value pexp can be obtained. THEOREM 4.12.

The value of the parameter pexp is 0.2 α−1 . ln α

PROOF. By definition of Def (exp) the region exp is [0, − 1λ ln α), and then cexp = 1/V(exp ) = − lnλα . To conclude, according to Theorem 4.6, the probability pexp is computed as follows.

pexp = cexp (F (u + Rexp ) − F (u − Rexp )) du = exp

λ =− ln α



− 1λ ln α 0



1 F u + 0.1 λ



  1 − F u − 0.1 du = λ

− 1 ln α λ λ =− e−λu (e0.1 − e−0.1 ) du = ln α 0 λ 0.1 1 −λu − 1λ ln α =− (e − e−0.1 ) e 0 ln α −λ α−1 α−1 = (e0.1 − e−0.1 ) ≈ 0.2 ln α ln α

Consider for example α = 0.003. By substituting the value exp = 0.2α = 0.0006 and pexp = 0.0343 in the formula (3), the following values for sup /N in correspondence of various values of N are obtained. N sup %

105 2.86%

106 2.10%

107 1.87%

+∞ 1.75% √

2 4.2.3 Laplace Distribution. The value of the parameter pLap is 0.142α− . ln α For the formal derivation of this value the reader is referred to Appendix A. Consider for example α = 0.003. By substituting the value  Lap = 0.000426 (see Appendix A) and pLap = 0.0243 in formula Eq. (3), the following values for sup /N in correspondence of various values of N are obtained.

N sup %

105 3.09%

106 2.18%

107 1.91%

+∞ 1.76%

ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers Normal distribution (N=100K, k=120, R=0.13)

400

250 200 150 100

300 250 200 150 100

0 0

5

200 150 100 50

90

140

80

120 100 80 60 40 20

40 Object value

60

0

80

5

5

70

DOLPHIN Uniform

70 60 50 40 30 20

0 0

0 Object value

10

Laplace distribution (N=100K, k=43, R=0.14) DOLPHIN Uniform

60 50 40 30 20 10

10

DOLPHIN Uniform 0 Object value

20

Number of nearest neighbors

160

Number of nearest neighbors

Number of nearest neighbors

250

Exponential distribution (N=100K, k=60, R=0.71)

Normal distribution (N=100K, k=120, R=0.13)

0

300

50

0 Object value

DOLPHIN Sample

350

350

50 0

400

DOLPHIN Sample

Number of nearest neighbors

300

4:17

Laplace distribution (N=100K, k=43, R=0.14)

Exponential distribution (N=100K, k=60, R=0.71) 450

DOLPHIN Sample Number of nearest neighbors

Number of nearest neighbors

350



20

40 Object value

60

80

0

0 Object value

5

10

Fig. 2. Empirical analysis of the algorithm.

4.3 Empirical Analysis The theoretical analysis conducted in the previous section will be now validated by considering three datasets composed of 100,000 objects coming from a normal (with mean μ = 0 and standard deviation σ = 1), an exponential (with mean μ = 1λ = 7.143, and α = 0.003), and a Laplace distribution (with parameters μ = 0 and β = 1, and α = 0.003), respectively. In particular, the distribution of the objects in INDEX is compared with: (i) a random sample of the whole dataset, and with (ii) a random sample of a uniform distribution in the inlier region, also referred to as uniform sample in the following. In these experiments both assumptions A1 and A2 were enforced, that is, both the PR1 and the PR2 were disabled during the execution of DOLPHIN. Figure 2 shows the results of the experiments. The curves report, for each object value, the number of neighbors it has in INDEX, in the random sample, and in the uniform sample. In the first row, the curves associated with the objects in INDEX at the end of the first scan (solid lines, labeled as DOLPHIN), together with the curves associated with a random sample (dashed lines, labeled as Sample) having the same size of INDEX, are reported. In the second row, the curves associated with the objects in INDEX (solid lines, labeled as DOLPHIN) are compared with the curves associated with an uniform sample (dashed lines, labeled as Uniform) having the same size of INDEX. For comparison, the dataset outliers have been appended to the uniform sample. As far as the comparison with the dataset random sample is concerned, it is clear that while INDEX allows to correctly recognize practically every object belonging to the inlier region (since these objects have at least k neighbors in INDEX), a random sample having the same size allows to correctly recognize only objects belonging to the region [−1.45, +1.45] ⊂ [−3, +3] for the normal ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:18



F. Angiulli and F. Fassetti

Table II. Theoretical Upper Bound sup and Size of INDEX for Various values of N (first row)

Normal

Exponential

Laplace

sup INDEX A1+A2 INDEX Outliers sup INDEX A1+A2 INDEX Outliers sup INDEX A1+A2 INDEX Outliers

105 3,942 (3.94%) 2,969 (2.97%) 743 (0.74%) 286 (0.29%) 2,859 (2.86%) 1,884 (1.88%) 309 (0.31%) 323 (0.32%) 3,097 (3.10%) 1,923 (1.92%) 373 (0.37%) 300 (0.30%)

106 31,644 (3.16%) 29,106 (2.91%) 8,058 (0.81%) 2,778 (0.28%) 21,019 (2.10%) 18,378 (1.84%) 2,675 (0.27%) 3,049 (0.30%) 21,795 (2.18%) 18,298 (1.83%) 2,286 (0.23%) 3,266 (0.33%)

107 290,832 (2.91%) 290,101 (2.90%) 87,943 (0.88%) 27,476 (0.27%) 186,676 (1.87%) 182,705 (1.83%) 22,460 (0.22%) 29,715 (0.30%) 191,115 (1.91%) 183,726 (1.84%) 23,148 (0.23%) 30,615 (0.31%)

distribution, to the region [0, 15.5] ⊂ [0, 41.49] for the exponential distribution, and to [−1.84, +1.84] ⊂ [−5.81, +5.81] for the Laplace one. This behavior confirms the discussion of Section 3.2. As for the comparison with the uniform sample it is evident that, under assumptions A1 and A2, the distribution of the objects in INDEX can be assimilated to a uniform one in the inlier region, thus further confirming the discussion of Section 3.2. Moreover, it appears that the distribution of INDEX is less sensitive to statistical fluctuations than the uniform sample. As a consequence, the latter allows to recognize a smaller number of inliers than the former. This behavior is due to the insertion policy of DOLPHIN. Table II compares the theoretical upper bound sup with the maximum size of INDEX during the first scan of the algorithm. Datasets coming from the same distributions previously introduced and having various sizes, namely 105 , 106 , and 107 , were considered. First, DOLPHIN was executed with the pruning rules PR1 and PR2 disabled (see rows labeled as INDEX A1+A2 in the table), and then the standard DOLPHIN algorithm was executed (see rows labeled as INDEX in the table), that is, with the pruning PR1 and PR2 enabled. In the latter executions the parameter pinliers was set to 0.05. The sensitivity of the method to the parameter pinliers will be studied in the Experimental Results section (see Section 8). The rows labeled as outliers report the number of outliers found in the datasets. The sizes of INDEX and of INDEX A1+A2 do not include the number of outliers, since sup does not take in account the outliers, as previously stated (Section 3.1). These experiments validate the theoretical upper bound sup derived in Section 3.1, as witnessed by the comparison between rows labeled sup and rows labeled INDEX A1+A2 . Furthermore, it is clear from rows labeled INDEX that the actual memory requirements of DOLPHIN are much smaller than those predicted by the upper bound sup : In these experiments, without taking into account the outliers, the size of INDEX corresponds to about the 10% of the upper bound sup . Hence, as anticipated, these experiments confirm that assumptions A1 and A2 are very penalizing for the algorithm. Moreover, note that the number of outliers in the dataset coincides with the value of α for which the unifying parameters have been determined, that is, ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:19

0.0027 for the normal distribution, and 0.003 for the exponential and Laplace ones. In Section 8 the value pDS of the parameter p associated with some real datasets DS will be estimated, and then the number of objects stored in INDEX by the algorithm will be compared with the upper bound sup computed by using pDS . The analysis on real datasets will further confirm the analysis here accomplished. 5. TEMPORAL COST ANALYSIS This section studies the temporal cost of the DOLPHIN algorithm. First, the implementation of the INDEX data structure is discussed (Section 5.1), and then the temporal cost of the algorithm is accounted for (Section 5.2). 5.1 Implementation Details In past years, many efforts for providing efficient algorithms for similarity search in metric spaces have been made, since there are a variety of fields where this kind of operation is useful (for a complete survey on indexing techniques, ´ refer to Chavez et al. [2001] or to Samet [2005]). These algorithms search for objects in a given data collection which are similar, or close, to an input object, also called query object. There are basically two types of similarity queries: range queries and k-nearest neighbor queries. Range queries, which are of interest here, take as input the query object q and a radius R, and return all the objects of the collection lying within distance R from q. By contrast, k-nearest neighbor queries take as input the query object q and a positive integer number k, and return the k objects of the collection which are closest to q. Most of the similarity search approaches proposed in the literature rely on the strategy of building an index, that is, a data structure aimed to reduce the number of distance computations at query time. Basically, these algorithms distinguish between indexing time and query time. Loosely speaking, at indexing time, the given collection of objects is partitioned into a set of equivalence classes. At query time, the triangle inequality is exploited in order to discard as many irrelevant classes as possible, while the objects of the nondiscarded classes are exhaustively compared with the query object. Any similarity search method proposed in the literature suitable for the data at hand can be used in order to implement the INDEX structure of the DOLPHIN algorithm, such as k-d-trees [Bentley 1975], AESA [Ruiz 1986], R∗ trees [Beckmann et al. 1990], VP-trees [Uhlmann 1991], X-trees [Berchtold et al. 1996], M-trees [Ciaccia et al. 1997], to cite few. One of the main techniques, used by a large class of similarity search algo´ rithms, is the pivoting based one [Chavez et al. 2001]. This kind of technique is that used in the current implementation of the DOLPHIN algorithm. At indexing time, a pivoting-based algorithm selects a certain number of objects, called pivots, and stores in the index all the pairwise distances among objects of the collection and pivots. Because of the triangle inequality, given two generic objects x and y, their distance dist(x, y) cannot be smaller than D p (x, y) = |dist(x, p)−dist( y, p)|, for ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:20



F. Angiulli and F. Fassetti

any other object p. Hence, the value D p (x, y) is a lower bound for dist(x, y). With m pivots p1 , . . . , pm , a better lower bound D(x, y) to the distance dist(x, y) can be obtained as max1≤i≤m D pi (x, y). Note that the value of D(x, y) is computed by exploiting only the precalculated distances among the objects x and y and all the pivots, and without the need of computing the distance between x and y. At query time, first of all, the distances between the query object q and all the pivots are computed. Then, the set of candidate objects belonging to the actual query outcome are obtained by selecting only the objects o such that D(q, o) ≤ R. Note that as soon as a term D pi (q, o) greater than R is encountered (1 ≤ i ≤ m), it can be immediately concluded that D(q, o) > R. Unfortunately, by using this strategy some spurious objects may be captured, namely objects o such that D(q, o) ≤ R but dist(q, o) > R. Hence, the true neighbors of q are eventually retrieved by a filtering phase consisting in computing the actual distances among q and each candidate object. Ideally, the set of candidate objects should coincide with the outcome of the query. However, minimizing the number of the spurious objects is an hard challenge. Usually, the greater the number of pivots, the smaller the number of spurious objects in the candidate set. A pivot-based index stores the pivot objects, the indexed objects, and the pairwise distances among pivots and indexed objects. Let m be the number of pivots and n the number of indexed objects, the space required by the index is O(md + n(m + d )), where d denotes both the space required to store a single object and, in the following, the time required to compute a distance. As for the temporal cost of a pivot-based index, it is expressed in terms of distance computations, since this is the most expensive operation that must ´ be accomplished [Chavez et al. 2001]. Thus, the querying cost is O(md ), corresponding to the cost of computing the distances among the query object and the m pivots. The insertion cost is also O(md ), since inserting an object requires to compute the distances between the new object and the pivots. The deletion cost is constant, since this operation consists in removing an object from the set of indexed objects. In the implementation of DOLPHIN, the filtering phase is not executed by the query search method, and the method directly returns the set of candidate objects. In particular, candidate objects are returned one at a time according to the following schema. As soon as a candidate object is detected, the range query search method saves its status and returns the candidate object to DOLPHIN in order to be processed. After having processed the object, DOLPHIN requests the next candidate object to the search method, which resumes its status and continues the search. This schema allows the application of the pruning rules as soon as each single candidate object is detected. Hence, if the PR1 or the PR3 is applied, the search stops early without the need of generating the whole set of candidate neighbors. As noticed earlier, performances of pivot-based indexes are related to the number of pivots employed. In particular, the larger the number of pivots, the more accurate (and the smaller) the list of candidates returned, but the cost of querying and building the index clearly increases. The number m of pivots used by DOLPHIN is logarithmic in the size of index, hence m = O(log n). ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:21

Thus, the cost of executing the range query search, corresponding to the cost of computing the distances among a dataset object and all the pivots, is upper bounded by O(d log n). As for the insertion of an object obj into the pivotbased index, it requires to compute the distances among obj and all the pivots. However, since insertion is always performed by DOLPHIN after the range query search, these distances are already available and, then, insertion has constant cost. 5.2 Temporal Cost Now the expected temporal cost of DOLPHIN is analyzed. The temporal cost is measured in terms of distance computations, since computing distances is the dominating operation to be accomplished by a distance-based outlier detection algorithm. The execution time of DOLPHIN is given by the sum of the two terms —T1 = N · (Cquery + Nquery · d ) + Nins · Cins + Nd el · Cd el , —T2 = N · (Cquery + Nquery · d ) + Nd el · Cd el ,

and

where T1 (T2 , respectively) accounts for the temporal cost of the first (second, respectively) scan, and: —N is the number of dataset objects; —d is the cost of computing a distance; —Cquery is the temporal cost of a range query search (the cost of selecting the set of candidate neighbors); —Nquery is the number of candidate neighbors per dataset object returned by the range query search (Nquery · d is the cost of the filtering phase of the range query search); —Nins (Ndel , respectively) is the number of insertions (deletions, respectively) into the index; and —Cins (Cdel , respectively) is the cost of insertion (deletion, respectively) into the index. Firstly the cost of the first scan is analyzed. Consider the term N ·(Cquery +Nquery · d ). It follows from the discussion of Section 3.1 that for large values of N , the maximum size of INDEX can be approximated to kp . Hence, the term Nquery is upper bounded by O( kp ). From the preceding discussion, Cquery is O(log kp · d ), and the term N · (Cquery + Nquery · d ) can be upper bounded by O( kp Nd), while the second and third terms are dominated by the first one. Hence, the temporal cost T1 of the first scan can be formulated as O( kp Nd). As for the second scan, both the terms Nquery and Ndel can be upper bounded by O( kp ), and, hence, the temporal cost T2 can be also formulated as O( kp Nd). Despite the fact that the derived worst-case asymptotic costs of the two scans are identical, it is worth pointing out that in practice the actual time of execution of the second scan should be much smaller than that of the first one. Indeed, in general, at the beginning of the second scan the size of the index is smaller ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:22



F. Angiulli and F. Fassetti

than the average size of the index during the first scan, since all the proved inliers there contained are removed, and, furthermore, during the execution of the second scan the size of the index is monotonically decreasing. Overall, the temporal cost of the DOLPHIN algorithm can be formulated as   k O Nd . p Note that the previously computed cost coincides with the cost of the algorithm if no indexing technique is used, that is, if INDEX is simply implemented as a list. As a matter of fact, basically this expression assumes that all the objects in the dataset are compared with all the objects in INDEX and then it does not take into account that the range query search returns only a subset of the nodes of INDEX. Hence, the worst-case temporal cost of the algorithm does not rely on the indexing technique. The range search may reduce the execution time, but the temporal cost O( kp N d ) is not affected by the effectiveness of the search. Furthermore, the previous analysis does not take into account that DOLPHIN is able to exploit the triangle inequality (through the PR1) to early prune inliers, and also that kp is an upper bound to the actual size of INDEX. Before concluding, three notable cases of the earlier depicted cost are to be considered. Nonmetric spaces. DOLPHIN can even be used in nonmetric spaces. In such a case, the pivot-based index is not used, INDEX is implemented as a list of objects, and the PR1 cannot be applied. It follows immediately from what was just stated that the temporal cost O( kp Nd) is still valid for nonmetric spaces. Worst-case temporal cost. Now, the scenario in which the algorithm cost degenerates in a quadratic one, namely O(N 2 d ), is discussed. Since the cost of the algorithm is O( kp Nd), it degenerates to O(N 2 d ) when the ratio kp becomes comparable to the dataset size N . Since k = N , the condition holds when N approaches N , that corresponds to the condition  ≈ p. Note p that, since kp represents the size of INDEX, when kp approaches N , INDEX contains a large part of the dataset. Hence, the closer kp to N , the closer the value of p to the probability for an object to have a neighbor in the dataset. Consequently, the closer the value of  to the value of p, the closer the value of k to the expected number of neighbors of a generic object in the dataset. It can be concluded that if the condition kp ≈ N holds, then the analysis is not meaningful. Indeed, since each object is expected to have k neighbors, the discrimination between inliers and outliers is weak. Hence, the worse the algorithm performs, the closer  is to p, and the more poorly the parameters are set. Comparison with indexing techniques. Finally, a comparison between the strategy pursued by DOLPHIN and the strategy based on indexing the whole dataset is given. In order to analyze this scenario, the cost of the DOLPHIN algorithm using a certain indexing technique I to implement INDEX is compared with the naive algorithm, referred to as naive method. The latter works in two steps: First, I ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:23

is employed to build an index containing the whole dataset and second, for each dataset object obj a range query search is performed on the index in order to find the neighbors of obj. Let N be the dataset size, let Cins (s) be the cost of inserting an object in the index data structure having size s, let Cquery (s) be the cost of searching for the neighbors of an object in the index data structure having size s, and, finally, let Nquery (s) be the number of candidates returned by the range query search. The cost of indexing the whole dataset, that corresponds to the first step of the naive method, is O(N · Cins (N )), whereas the second step costs O(N · (Cquery (N ) + Nquery (N ) · d )). Hence, the total cost of the naive method is O(N · Cins (N ) + N · (Cquery (N ) + Nquery (N ) · d )).

(6)

Recall that the cost of DOLPHIN, as shown at the beginning of this section, is O(N · Cins (smax ) + N · (Cquery (smax ) + Nquery (smax ) · d )),

(7)

where smax denotes the maximum number of objects stored in INDEX by DOLPHIN. Therefore, it can be concluded that in the worst case of DOLPHIN, namely when DOLPHIN stores in INDEX all the dataset objects, the cost of the naive method and the cost of DOLPHIN are asymptotically the same. It has been just shown that DOLPHIN stores in INDEX a significant portion of the dataset only if the parameters are not meaningful. Moreover, in meaningful cases the INDEX size smax is very much smaller than the dataset size N , as confirmed both theoretically (Section 3) and empirically (Section 8). Then, the costs Cins (·) and Cquery (·) and the number of candidates Nquery (·) are much smaller for DOLPHIN (Eq. (7)) than for the naive method (Eq. (6)). 6. FIXED MEMORY ALGORITHM Even though it has been shown that for meaningful values of the parameters R and k the size of INDEX amounts to a small fraction of the dataset size, in some cases only a limited buffer of main memory can be available. In order to deal with this scenario, in this section a modification to the basic schema of DOLPHIN is described. Figure 3 shows the algorithm fixed-memory DOLPHIN. In particular, when a new node has to be inserted into INDEX, the algorithm checks whether the buffer is full, and if it is the case, it removes some inliers from the memory buffer. This is accomplished by the procedure freeBuffer (lines 5–6). Procedure freeBuffer. This procedure takes care of releasing part of the memory buffer. In particular, it releases the 1 − ν percent of the buffer, where ν is a user-provided parameter, that will be discussed next. This task is accomplished in two phases (Figure 3, lines 13–23). During the first phase, some proved inliers already stored in INDEX are removed (lines 13– 14). Proved inliers are removed until the buffer contains exactly the βν fraction of proved inliers (0 ≤ β ≤ 1), or the fraction of used buffer is the ν percent of the ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:24



F. Angiulli and F. Fassetti

Fig. 3. Fixed-memory DOLPHIN outlier detection algorithm.

buffer size. If the latter condition is verified, then the procedure terminates its work without executing the second phase, and the first scan of the algorithm is resumed. Otherwise, an inner scan of the dataset is needed to recognize inliers to be removed from INDEX. The entries of the arrays n.nn associated with nodes n stored in INDEX are set to zero. All the proved inliers stored in INDEX at the beginning of the inner scan, which comprise at most the βν percent, are kept in INDEX during the execution of the inner scan. The inner scan starts from the beginning of the dataset file. For each dataset object read during the inner scan, the procedure pruneInliers is executed. The procedure pruneInliers slightly differs from the homonyms procedure employed by DOPHIN, since now it removes from INDEX only those objects that become proved inliers during the execution of the inner scan. If during the inner scan only the ν percent of the buffer is used, then the scan is stopped and the first scan is resumed (lines 19–20). ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:25

Otherwise, the inner scan reaches the end of the file and then each object stored in INDEX saw all the dataset objects, and, as a consequence, it can be recognized either as an outlier or as a proved inlier. Indeed, the inner scan is analogous to the second scan of the algorithm, with the only difference that it is executed on the subset of the candidate outliers collected until the memory buffer becomes full. Recall that, due to the aforementioned strategy, the proved inliers stored in INDEX form at most the βν fraction of the buffer size. All the objects which are not proved inliers can be safely returned in output as outliers and removed from INDEX, whereas the proved inliers are kept, in order to let the PR1 be effective when the outer scan is resumed. By means of this strategy, the freeBuffer procedure is always able to release enough memory for resuming the outer scan. After the execution of freeBuffer, the object ncurr is inserted into INDEX and the first scan is resumed. The algorithm continues according to strategy detailed in Section 2, but with one minor modification, which is detailed next. Recall that a progressive number, called record identifier, is associated with each dataset object, representing its position on the dataset file. Say b the record identifier of the last object read during the inner scan, and say a the record identifier of the first object ncurr .obj which is inserted into INDEX when the first scan is resumed (line 7 of Figure 3). During the inner scan, the objects stored in INDEX (all these objects have record identifier smaller than a) have seen all and only their neighbors having record identifier less than or equal to b. Thus, in order to guarantee consistency of the radius ncurr .rad associated with nodes nindex already stored in INDEX, the array nindex .nn must not be updated (line 16 of Figure 1) if both conditions nindex .id < a and ncurr .id ≤ b hold. Next the strategy pursued by fixed-memory DOLPHIN and the rationale underlying the two parameters ν and β are discussed. The parameter ν has a twofold usefulness. First, it must be said that the algorithm may benefit from keeping some objects in INDEX. Indeed, when the first scan is resumed, the proved inliers kept in INDEX may allow the pruning rules PR1 and PR3 to be immediately effective. Second, if the buffer does not have to be completely emptied, then less dataset objects have to be read during the inner scan, which, hence, stops early. Note that if the buffer had to be emptied, then in order to recognize outliers, the inner scan should always reach the end of the dataset. Clearly enough, the larger the value of ν, the higher the aforesaid benefits, but the earlier the buffer will be filled again. As for the parameter β, it is used to guarantee that at least a certain fraction of the objects stored in INDEX at the end of the procedure freeBuffer is composed of proved inliers. The larger the value of β, the larger the number of unclassified objects (i.e., either outliers or nonproved inliers) that must be removed from the buffer, and, hence, the larger the number of objects that must be read during the inner scan. Nevertheless, in order to maximize the effectiveness of PR1, the great majority of the objects kept in INDEX by the procedure freeBuffer should be proved inliers. As a trade-off, the parameter β is by default set to 0.75. The sensitivity of the algorithm with respect to the buffer size will be analyzed in the experimental results section (see Section 8). ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:26



F. Angiulli and F. Fassetti

7. RELATED WORK As already pointed out in this work, distance-based outliers have been first introduced by Knorr and Ng [1998] with the aim of overcoming some limitations of statistical tests. Some variants of the original definition have been subsequently introduced in the literature. Next these definitions and related methods are briefly surveyed, and then techniques handling the original definition, which are more related to the work here done, are considered. Since the original definition of Knorr and Ng [1998] does not provide a ranking of the outliers, Ramaswamy et al. [2000] introduced the following alternative definition: Given two integers k and n, an object obj is said to be an outlier if less than n objects have higher value for D k than obj , where D k denotes the distance from obj to its kth nearest neighbor. The authors also presented two algorithms to detect outliers. The first assumes that the whole dataset is already stored in a spatial index, like the R ∗ -tree [Beckmann et al. 1990], and uses it to compute the distance from each dataset object to its kth nearest neighbor. Pruning optimizations are exploited to reduce the number of distance computations while querying the index. The authors noted that this method is computationally expensive and introduced a partition-based algorithm to reduce the computational cost. The second algorithm first partitions the input points using a clustering algorithm, and then prunes the partitions that cannot contain outliers. Experiments were reported only up to ten dimensions. Subsequently, Angiulli and Pizzuti [2005, 2002], with the aim of taking into account the whole neighborhood of the objects, proposed to rank them on the basis of the sum of the distances from the k nearest neighbors, rather than considering solely the distance to the kth nearest neighbor. They also presented an algorithm, called HilOut, for mining the top-n distance-based outliers according to the introduced definition. The algorithm works with numerical datasets and consists of two phases. The first phase exploits Hilbert space-filling curves in order to project the data in the full feature space multiple times onto the interval [0, 1]. Each successive projection improves the estimate of an object’s outlier score in the full-dimensional space and progressively reduces the set of candidate outliers. This phase guarantees an approximate solution within a deterministic factor, with temporal cost linear in the dataset size and at most quadratic in the dataset dimensionality. The second phase calculates the exact solution, with temporal cost at most quadratic in the dataset size and linear in the dataset dimensionality. The three aforementioned definitions are clearly related. In particular, there exist values for the parameters such that the outliers found by using the first definition, that is the original one, are the outliers mined by using the second definition, while the same property does not hold in general for the third one. Nonetheless, it must be pointed out that while the various definitions share the parameter k, the definition provided in Knorr and Ng [1998] requires as a parameter the radius R, which implicitly defines the number of objects that will be returned as outliers, while the definitions provided in Ramaswamy et al. [2000] and in Angiulli and Pizzuti [2002] require instead the number ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:27

n of outliers, which implicitly defines a score above which objects have to be considered outliers. For the definition of Ramaswamy et al. [2000] this score coincides with a radius. It must be further said that some distance-based outlier methods have been designed to support uniquely the original definition, while some others support all the three mentioned definitions. This work deals with the original definition, even if the proposed technique is compared also with a method supporting all the three definitions, that is ORCA, described in the following. Knorr et al. [2000, 1998] presented basically two algorithms. The first one is a block nested loop algorithm that runs in O(dN2 ) time, where N is the number of objects and d the dimensionality of the dataset. The second one is a cell-based algorithm that is linear with respect to N , but exponential in d . The last method is fast only if d is small. On the other hand, the nested loop approach does not scale well with respect to N . Thus, efforts for developing efficient algorithms scaling to large datasets have been subsequently made. For a brief survey of the ORCA, RBRP, and SNIF algorithms the reader is referred to the Appendix B. 7.1 Comparison with Related Methods DOLPHIN is a very efficient distance-based outlier detection algorithm specifically designed for working with disk-resident datasets. It gains efficiency by naturally merging together in a unified schema the three following strategies: (1) selection policy of objects to be maintained in main memory, (2) usage of pruning rules, and (3) similarity search techniques. Indeed, it has been shown (see Section 4) that for well-founded combinations of the parameters k and R, the maximum number of objects maintained in memory by the algorithm with the pruning rules disabled corresponds to a small fraction of the dataset. Moreover, thanks to the three pruning rules (see Section 4.3 and Section 8), and by properly setting the parameter pinliers , the actual amount of main memory is sensibly reduced, together with the number of distance computations needed. Finally, since main-memory-resident objects are stored in an index data structure (see Section 5.1), neighbors are searched for as efficiently as possible and the number of distance computations is further reduced. DOLPHIN achieves simultaneously linear CPU time performance and linear I/O cost on very large multidimensional disk-resident datasets with a small usage of main memory (see Section 3.1 and Section 5.2). It must be noted that none of the previously introduced methods is able to simultaneously guarantee these two goals. A modification to the basic method (see Section 6) allows DOLPHIN to work when the available buffer of main memory is smaller than its standard requirements, with a reasonable increase of the execution time (see Section 8.9). ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:28



F. Angiulli and F. Fassetti

The comparison with other methods is taken into account next. The technique of detecting outliers by directly exploiting existing indexing techniques [Bentley 1975; Beckmann et al. 1990; Berchtold et al. 1996; B¨ohm ´ et al. 2001; Chavez et al. 2001], in order to search for the kth nearest neighbor of an object, suffers of the drawback that all the dataset objects have to be stored into the index structure. Besides, the approach of computing the k nearest neighbors of each object is not very efficient for outlier detection, since for a lot of objects this task can be avoided by using more clever strategies (see also the theoretical comparison in Section 5.2). Some outlier detection algorithms exploit indexing or clustering [Ramaswamy et al. 2000; Ghoting et al. 2006], but require to build the index, or to perform clustering, on the whole dataset and in some cases to store the clustered data in main memory. This kind of processing may be impractical on large disk-resident datasets, which must be on the contrary elaborated by sequentially reading blocks of disk pages. Moreover, the claimed temporal complexity of the algorithm RBRP [Ghoting et al. 2006] is O(N log N · d ), that is, strictly superlinear in dataset size. The algorithm ORCA [Bay and Schwabacher 2003] works directly on diskresident data and its claimed temporal cost is near linear. The temporal cost of DOLPHIN is also linear with respect to the dataset size (see Section 5), but it greatly reduces the number of distance computations by exploiting both range searching and pruning rules, which are not present in ORCA. As far as the I/O cost of ORCA is concerned, the ORCA algorithm may present a quadratic I/O overhead (the reader is referred to Tao et al. [2006] for an analysis showing that the I/O cost of ORCA may be quadratic in terms of disk pages read). Moreover, as pointed out by the authors of ORCA, the value b for buffer size can have a large effect on computation time of their algorithm: A small value results in more frequent data accesses, slowing down computation, while larger values result in fewer data accesses but can result in slower computation times because cache efficiency decreases. As a matter of fact, the authors of ORCA observed that there exists a buffer size limit b∗ beyond which the performance of the algorithm gets worse. The value of b∗ empirically validated is about 1,000. It can be concluded, that, differently than DOLPHIN, ORCA is not able to take advantage of the whole main memory available, since enlarging the memory buffer may result in slowing down its execution time. As far as the algorithm SNIF presented in Tao et al. [2006] is concerned, the authors show that the I/O cost of their algorithm is low, since the algorithm scans the dataset three times and, with additional memory usage, in some cases only twice, while DOLPHIN always performs two dataset scans. In order to work and to perform three dataset scans, SNIF requires that at least M 0 objects can be allocated in main memory (see Eq. (12) for the definition of M 0 ), while in order to hopefully reduce the number of scans to two, the memory must be larger. Otherwise, if the memory available is not enough to accommodate all the objects required by DOLPHIN, that, besides, are expected to be far less than those required by SNIF (see the following), the fixed-memory DOLPHIN algorithm can be even used. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:29

Consider the s objects selected by SNIF to be used as partition centroids. First of all, it must be noted that during the execution of the algorithm their number is fixed, since s is an user-provided parameter, while in DOLPHIN each object can be potentially exploited to prune incoming objects through the PR1. Moreover the PR1 is more powerful than the concept of partition used by SNIF. Second, it follows from the selection policy adopted by SNIF that the s centroids are distributed according to the dataset population. Hence, it is expected to have too many centroids in the mostly populated regions of the feature space and an insufficient number in the moderately populated regions and in the boundary regions. Advantages of the selection policy of DOLPHIN in terms of number of objects to be selected have already been shown in Section 3.2 and in Section 4.3. As far as the number of distances to be computed is concerned, let M ≥ M 0 be the number of objects that can be accommodated in main memory by SNIF. It follows from the strategy of the algorithm that, up to the first critical moment, the method is quadratic in the buffer size, since O(M 2 ) distances have to be computed among all the pairs of objects accommodated into the buffer. Next, each incoming object has to be compared with at least all the s centroids in order to update the density of the associated partitions, and, moreover, neighbors of the objects have to be searched for among the other objects stored in main memory. Both operations are accomplished by performing an exhaustive search. Contrariwise, DOLPHIN exploits range search and its pruning rules are applied early without the necessity of assigning the object to a certain partition in order to update the data summary. Finally, it must be taken into account that every time SNIF performs the flushing of the buffer, it moves at most M/2 unclassified objects to a verification file in order to free buffer space. Hence, the verification file must be added to the memory requirements of the method, and managing this file also increases the I/O cost. In order to correctly classify the objects stored in the verification file, multiple additional dataset scans are possibly needed. 8. EXPERIMENTAL RESULTS This section presents experiments performed using the DOLPHIN algorithm. Experiments are organized as follows. First of all, we study the sensitivity of the method to the parameter pinliers (Section 8.1). Section 8.2 describes characteristics of the datasets employed in subsequent experiments. Next, the course of the size of INDEX (Section 8.3), the sensitivity to the parameter h (Section 8.4), the memory requirements, the execution time, the effectiveness of the pruning rules (Section 8.5), and the sensitivity to the parameters R and k (Section 8.6) are analyzed. Section 8.7 introduces a method to set parameter R to significant values without trial and error. DOLPHIN is then compared with other outlier detection algorithms (Section 8.8). Afterwards, the sensitivity of the fixed-memory DOLPHIN to the buffer size is studied (Section 8.9). Finally, we study how the curse of dimensionality affects the performance of the algorithm (Section 8.10). ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.



4:30

F. Angiulli and F. Fassetti

Normal 100K, k=120, R=0.13, Outliers=286

Exponential 100K, k=60, R=0.71, Outliers=323 1800

Maximum Average Final

1600 Index size after first scan

2500

Index size

2000 1500 1000

Laplace 100K, k=43, R=0.14, Outliers=300 2000

Maximum Average Final

1400

Index size after first scan

3000

1200 1000 800 600 400

Maximum Average Final

1500

1000

500

500 200 0

0

0.2

0.4

0.6

0.8

1

0

0

0.2

pinliers

(a) normal distribution

0.4

0.6

0.8

1

0

0

0.2

0.4

0.6

0.8

pinliers

pinliers

(b) exponential distribution

(c) Laplace distribution

1

Fig. 4. Size of INDEX after the first scan for some standard distributions.

8.1 Sensitivity to the Parameter pinliers In this section the sensitivity of the DOLPHIN algorithm to the parameter pinliers is studied. First of all, three datasets composed by 100000 objects distributed according to a normal, an exponential, and a Laplace distribution are considered. The parameters k and R employed are those reported in Section 4.3. Figure 4 shows, for various values of pinliers , the maximum size of INDEX during the first scan (solid lines), the average size of INDEX during the first scan (dashed lines), and the size of INDEX at the beginning of the second scan (dotted lines), that is, the size of INDEX at the end of the first scan after having deleted the proved inliers there-in contained. The size does not comprehend the number of outliers, which are 286, 383, and 300, for the normal, exponential, and Laplace datasets, respectively. Even though it may be expected that by deleting all the proved inliers ( pinliers = 0) the size of INDEX should be smaller, it was observed that, in terms of both maximum and average size of the index, the best value of pinliers is approximatively in the range [0.01, 0.1]. In particular, it can be observed that the minimum value of maximum and average size of INDEX is obtained for values of pinliers close to 0.05. Furthermore, the size of INDEX reaches two maxima in correspondence of pinliers = 0 and pinliers = 1. Indeed, when pinliers is set to a very small value, the PR1 becomes less applicable and, then, a large fraction of inliers is not recognized during the first scan. Vice versa, when pinliers is set to a very large value, the number of recognized inliers increases, but also a large fraction are retained, so that the size of INDEX globally increases. The number of (nonproved) inliers contained in INDEX at the beginning of the second scan (dotted lines) is inversely proportional to pinliers . Interestingly, for pinliers = 1, INDEX practically does not contain inliers (their number amounts to few units). However, also for values of pinliers in the range [0.05, 0.1] the number of proved inliers contained in INDEX at the end of the first scan amounts to a considerable fraction of the size of INDEX. We will further investigate the effect of pinliers on the size of INDEX on several real datasets in Section 8.3. It can be anticipated that the behavior observed in the following will be similar to that here described. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers

Clusters data set, k=20, R=3.0

300

300 250

Index size

Number of index nodes

350

p =0.00 inliers p =0.05 inliers p =1.00 inliers

350

250

200

200 150

150

100

100

50

50

0 0

4:31

Clusters data set, k=20, R=3.0

400

400



1000

2000 3000 4000 Number of data set objects

(a)

5000

0

Maximum Average 0

0.2

0.4

p

0.6

0.8

1

inliers

(b)

Fig. 5. Course of the size of INDEX for the Clusters dataset.

To conclude, it must be noticed the parameter pinliers may sensibly influence the execution time. In general, the smaller the average size of INDEX, the smaller the expected number of objects each dataset object has to be compared with. Clearly, the size of INDEX is not the only factor affecting the execution time, the others being the effectiveness of the range query search and the pruning rules. Next we analyze the effect of pinliers on the course of the size of INDEX on a high clustered synthetic dataset, since this study will further shed light on the strategy pursued by DOLPHIN (see Figure 5). The structure of the dataset is intentionally simple in order to make the results easily interpretable. The synthetic dataset, named Clusters, is composed by 20 well-separated uniform clusters in the plane, having 10000 points each. For R = 3 each object has 10000 neighbors, that are all the objects of the cluster it belongs to. Figure 5(a) shows the size of INDEX during the execution of the first scan as a function of the number of objects already seen, for k = 20 and for varying values of the parameter pinliers , that is, pinliers = 0 (solid line), pinliers = 0.05 (dashed line), and pinliers = 1 (dotted line). Note that for this experiment the value of p is 0.05, which in this case coincides with the probability that two randomly picked objects of the dataset 20 are neighbors, the value of sup is kp = 0.05 = 400, and curves in Figure 5 are correctly upper bounded by sup . For pinliers = 0, after INDEX has accumulated enough objects to recognize inliers, the algorithm deletes the proved inliers into INDEX and forgets what it has already seen. The size of the index becomes thus oscillating, since the process is then repeated. In this dataset, the fluctuation is very pronounced as the objects are strongly clustered. For pinliers = 1, after INDEX has accumulated enough objects to recognize inliers, that is 380 objects (19 from each of the 20 clusters), its size stabilizes and subsequent inliers are all recognized as soon as they arrive. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:32



F. Angiulli and F. Fassetti Table III. Datasets Employed Dataset Color Histogram DARPA 1998 Forest Cover Type Household Landsat Server

Objects 68,040 499,467 581,012 1,000,000 275,465 500,000

Attributes 32 24 54 3 60 5

On the contrary, for pinliers = 0.05, after having accumulated in INDEX sufficient objects to recognize inliers, a large portion of proved inliers are removed, but, since the algorithm has learned the dense regions of the feature space (this information is stored together with the proved inliers), PR1 is applied efficiently and the size of the index stabilizes on a very small fraction of the upper bound sup . Figure 5(b) shows the maximum (solid line) and the average (dashed line) size of INDEX during the first scan of the algorithm as a function of the parameter pinliers . The minimum value of average size of INDEX is obtained for pinliers slightly greater than zero, since the dataset is strongly clustered, that is, for pinliers approximatively equal to 0.025. In this case, the average size is about 21.75, which is very close to the optimum value 20, that is, exactly one object per cluster. Moreover, compare the average size 21.75 with the upper bound k = 400. p Summarizing, by properly tuning the parameter pinliers , only a small fraction of the dataset is inserted into INDEX. As will be also confirmed in the following, a value for pinliers approximately equal to 0.05 appears to be a good trade-off between the size of INDEX (amount of redundant information) and the execution time (average size of INDEX and effectiveness of the pruning rules). Based on the previous analysis, in the following by default the parameter pinliers will be set to 0.05, even if in some cases also the values 0 and 1 will be considered. 8.2 Dataset Characteristics Table III summarizes the datasets used in the experiments illustrated in next sections. These datasets are briefly described next. Color Histogram contains image features extracted from a Corel image collection.2 DARPA 1998 consists of network connection records of intrusions simulated in a military network environment, from five weeks of training data.3 Forest Cover Type contains forest-cover-type data determined from the U.S. Forest Service (USFS) Region 2 Resource Information System (RIS) data.4 Household is released by the U.S. Census Bureau and contains the annual expenditure of American families on electricity, gas, and water, as described in Tao et al. [2006]. Landsat contains normalized feature vectors associated with 2 http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.html. 3 http://www.ll.mit.edu/IST/ideval/index.html. 4 http://kdd.ics.uci.edu/databases/covertype/covertype.html.

ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers Landsat data set, pinliers=0.05

=0.05

inliers

1500

1000

0 2000

300

200

3000

4000 5000 6000 7000 Radius [R] DARPA 1998 data set, pinliers=0.05

10

0 0.4

8000

0.5

0.6 0.7 Radius [R]

0.8

0 0.3

0.9

1

4

0.4

0.5 Radius [R]

0.6

0.7

Server data set, pinliers=0.05, index=PIVOT k=100 k=300 k=500

3

10 Number of outliers

Number of outliers

2

10

100

k=116 k=348 k=581

4000

10

150

10

5000

3

200

50

Forest data set, pinliers=0.05

k=100 k=300 k=500

k=13 k=40 k=68

250

100

500

Number of outliers

300

k=55 k=165 k=275

400 Number of outliers

Number of outliers

2000

4:33

ColorHistogram data set, pinliers=0.05

500 k=200 k=600 k=1000

Number of outliers

Household data set, p 2500



3000

2000

2

10

1

10

10 1000

0

10

0.6

0.8

1 1.2 Radius [R]

1.4

1.6

0 450

0

500

550

600 650 Radius [R]

700

750

10

0

2000

4000 6000 Radius [R]

8000

10000

Fig. 6. Number of outliers for various values of R and k.

tiles of a collection of large aerial photos.5 Server is an extract of the KDD Cup 1999 data, as described in Tao et al. [2006]. According the methodology suggested in Tao et al. [2006], in most of the experiments reported in the following, for the parameter k we will employ a value ranging from 0.02%–0.1% of the dataset size, while for the parameter R, the value employed will range in the interval [Rmin , Rmax ], where Rmin (Rmax , respectively) is the radius corresponding to exactly one outlier (the 0.1% dataset size of outliers, respectively) when k is set to the 0.05%. Figure 6 reports the value of the parameters employed on the datasets Color Histogram, DARPA 1998, Forest Cover Type, Household, Landsat, and Server, together with the corresponding number of outliers. In order to validate the upper bound computation of Section 3.1 on real-life distributions, the value pDS for the parameter p (see Definition 3.1) associated with a generic dataset DS has been estimated through the following procedure. First of all, the feature space is filled with a d dimensional grid of objects, say G this set of objects, and, for each object g of G, the number n g of neighbors it

the subset of G composed of the objects g such has in DS is determined. Say G ng 1  that n g > k − 1. The parameter pDS is then obtained as |G|

N.

g ∈G In order to obtain a reliable estimation, the grid G must be dense. Hence, the number of objects composing the grid was progressively augmented untill convergence of the value pDS was achieved. Unfortunately, if d is greater than some units, the aforesaid procedure becomes ineffective. Hence, we were able to estimate the parameter p only on the Household dataset (d = 3). For k = 600 and R = 2,600.781, the value pHousehol d estimated was 0.0746. It must be pointed out that it may be difficult to estimate the parameter p for multidimensional datasets, but this does not imply that the algorithm does not 5 http://vision.ece.ucsb.edu.

ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.



F. Angiulli and F. Fassetti

Number of index nodes [%]

Number of index nodes [%]

Household, k=600, R=2600.780 1.5

pinliers=0 pinliers=0.05 p =1 inliers

1

0.5

0 0

20

40 60 Data set percentage

80

inliers

1

0.5

0 0

100

20

Landsat, k=55, R=0.451

1.5

2

p =0 inliers =0.05 p inliers pinliers=1

1

0.5

0 0

20

40 60 Data set percentage

40 60 Data set percentage

80

pinliers=0 pinliers=0.05 p =1 inliers

1

0.5

0 0

100

20

Landsat, k=165, R=0.451

Number of index nodes [%]

Number of index nodes [%]

2

Household, k=1000, R=2600.780 1.5

pinliers=0 pinliers=0.05 p =1

Number of index nodes [%]

Household, k=200, R=2600.780 1.5

80

100

1.5

2

p =0 inliers =0.05 p inliers pinliers=1

1

0.5

0 0

20

40 60 Data set percentage

40 60 Data set percentage

80

100

80

100

Landsat, k=275, R=0.451

Number of index nodes [%]

4:34

80

1.5

p =0 inliers =0.05 p inliers pinliers=1

1

0.5

100

0 0

20

40 60 Data set percentage

Fig. 7. Course of the size of INDEX.

perform well on this data. As a matter of fact, in the following it will be shown that for meaningful values of the parameters, the size of INDEX corresponds to a small fraction of the dataset size. 8.3 Course of the Index Size In this section we study the course of the size of INDEX during the execution of DOLPHIN on some real datasets. Experiments on both a low-dimensional dataset, that is, Household (3 attributes), and an high-dimensional dataset, that is, Landsat (60 attributes), are described. Curves concerning the other datasets exhibited a similar behavior. Figure 7 shows the size of INDEX versus the percentage of objects read from the dataset, for increasing values of the parameter k. As for the parameter h, it is set to 16 (see Section 8.4). Note that in all cases, the growth of the size of INDEX slows down as the percentage of dataset objects already seen increases. On the Household dataset (see Figure 7, first row), for pinliers = 0.05 the maximum size of INDEX is in any case below 0.5%, even for k = 1000. By using k = 1000 and R = 2600 (for these parameters there are more than 2000 outliers; see Figure 6), when pinliers = 1, INDEX contains at most the 1.3% of the dataset objects, while for pinliers equal to 0 the maximum size of INDEX is slightly above the 0.5% of the dataset. For pinliers = 0 the size of INDEX is oscillating, while for pinliers = 0.05 it is more stable, confirming the analysis of the previous section. From these curves it is clear that, on this dataset, in terms of INDEX size, setting pinliers to 0.05 is better than to 0, as the former value avoids fluctuations in size. Importantly, these experiments agree with the theoretical analysis of Section 3.1. Indeed, as already stated earlier, for k = 600 and R = 2600, the ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.



DOLPHIN: Algorithm for Mining Distance-Based Outliers

4:35

Table IV. Execution Time and INDEX Node Size for Various Values of h Dataset Color Histogram R = 0.343, k = 40 DARPA 1998 R = 0.527, k = 300 Forest Cover Type R = 496.826, k = 348 Household R = 2,600.780, k = 600 Landsat R = 0.451, k = 165 Server R = 1,529.404, k = 300

h=1

h=2

25.58 133 516.78 102 1,861.25 222 372.41 18 55.25 245 146.95 26

21.30 134 108.11 103 1,390.28 223 27.14 19 23.58 246 54.55 27

h=8

h = 16

h = 32

21.48 19.08 135 138 100.80 100.05 105 109 1,223.50 1,105.34 225 229 27.05 25.28 21 26 23.33 22.89 248 252 48.81 47.17 29 33

h=4

21.17 143 97.30 117 1,045.67 237 25.06 35 22.61 259 46.11 41

20.69 154 100.06 133 1,022.17 254 27.14 53 22.97 274 46.77 57

estimated value pHousehol d for the parameter p associated with the Household dataset is 0.0746, and the associated upper bound sup to the size of INDEX corresponds to the 0.96% of the dataset size. Compare this value to the actual maximum size of INDEX without taking into account the outliers, that is, 8346 − 1033 = 7313, the 0.73% of the dataset size. As for the Landsat dataset (see Figure 7, second row), the maximum size of INDEX is always directly proportional to the value of the parameter pinliers . For R = 0.451 and k = 275, in the worst case, the size of INDEX reaches the 2%, but it is about 1% for pinliers = 0.05, and about 0.5% for pinliers = 0. As already noted, although the size of INDEX is a factor affecting the execution time, there are also other factors: the effectiveness of the range query search and of the pruning rules. It will be shown in the following that the execution time will decrease sensibly if a little amount of redundant information is maintained in INDEX (i.e., for pinliers = 0.05). 8.4 Sensitivity to Parameter h In this section the sensitivity of DOLPHIN to the parameter h, which is the size of the n.nn array associated with every DBO-node, is analyzed. Table IV shows the results of the experiments. DOLPHIN was executed on each dataset of Table III by using for R the value Rmin and for k the intermediate value obtained by means of the procedure described in Section 8.3. The first column reports the dataset together with the parameter values employed, whereas the other columns report the execution time6 (first row, in seconds) and the size of each INDEX node (second row, in bytes) for h ∈ {1, 2, 4, 8, 16, 32}. From these experiments it is clear that for h = 1 the algorithm performs worse. This can be explained since in this case the pruning rule PR1 cannot be applied. As far as the other values of the parameter h are concerned, it can be observed that for small values of h, the execution time is inversely proportional 6 We

used a Pentium IV 3.4 GHz-based machine with 1GByte of main memory and the Windows XP operating system. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:36



F. Angiulli and F. Fassetti

to h. However, for values of h approximatively greater than 16, the execution time appears to get worse. Indeed, in this case, the additional cost, both in terms of time and space, to be paid to manage a greater n.nn data structure is not rewarded by a corresponding decrease of execution time, since the pruning rule PR1 does not take enough advantage of the finer granularity of the histogram n.nn. Table IV also shows the size of each INDEX node (in bytes). In particular, in order to store a DBO-node the following amount of memory (in bytes) is allocated by DOLPHIN:   h log2 k v = (d + 1)w + , (8) 8 where w denotes the number of bytes needed to store a word (w = 4 in the current implementation), that is, an integer or a floating point number, (d + 1) words are needed to store the fields n.obj and n.id , and the remaining bytes are needed to store the n.nn array, which is composed of h integers in the range [0, k − 1]. Note that when pinliers is set to zero, then only one counter is needed to store the number of neighbors of each object (the PR1 is never used), and, hence, h is set to one. In all the subsequent experiments, the value of h has been fixed to 16 in order to optimize the execution time, though also with smaller values good performance is achieved. 8.5 Memory Requirements, Execution Time and Effectiveness of Pruning Rules Figures 8 and 9 show the size of INDEX (first row), the execution time (second row), and the number of times the pruning rules are applied (third row) during the execution of DOLPHIN on the Household and Landsat datasets, respectively. Concerning the size of INDEX (first row), the figures also show the memory usage (in megabytes) of the algorithm when the maximum value of R is employed. It follows from the discussion of Section 5.1 that the total amount M of main memory (in bytes) required by DOLPHIN is M = n(v + mw) + md w, where n is the size of INDEX, v is the size of a node computed through Eq. (8), and m is the number of pivots used. In particular, in the implementation of DOLPHIN the number of pivots grows logarithmically with n, and hence m = log2 n. From left to right, curves displayed in Figures 8 and 9 are obtained by setting pinliers to 0, 0.05, and 1, respectively. Note that pinliers = 0 gives the slowest performance, while pinliers = 0.05 gives the fastest one. In order to understand this behavior, it is useful to study how frequently the pruning rules PR1, PR2, and PR3 are used (third row). For pinliers = 0, almost all the objects are pruned by PR2, that is, almost all the objects are inserted into INDEX when read and then deleted from INDEX after having seen k − 1 neighbors. On the contrary, ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.



DOLPHIN: Algorithm for Mining Distance-Based Outliers =0.00

Household data set, p

inliers

0.5 0.23MB

0.3

0.09MB

0.1

3000

4000

5000 6000 Radius [R]

Household data set, p

7000

0.6 0.40MB

0.5 0.4 0.3

0 2000

8000

k=200 k=600 k=1000

350 300 250 200

25 20 15

100

10 4000

5000 6000 Radius [R] Household data set, k=1000, p

7000

5 2000

8000

=0.00

inliers

6

1 0.67MB

0.8 0.6 0.4

0 2000

Deleted objects

4

2

4000

5000 6000 7000 Radius [R] Household data set, k=1000, pinliers=0.05

3000

5000 6000 Radius [R]

7000

8000

60 50 40 30

0 2000

3000

4000 5000 6000 7000 Radius [R] Household data set, k=1000, pinliers=1.00

8000

6

10

4

4

10

2

10 PR1 PR2 PR3

0

4000

k=200 k=600 k=1000

70

8000

10

10 2000

8000

10 3000

10 PR1 PR2 PR3

4000 5000 6000 7000 Radius [R] Household data set, pinliers=1.00

20

2

10

3000

80

10

10

0.24MB

0.2

6

10

Deleted objects

1.2

k=200 k=600 k=1000

30

k=200 k=600 k=1000

1.02MB

8000

35

150

0

4000 5000 6000 7000 Radius [R] Household data set, pinliers=0.05

40 Execution time [sec]

Execution time [sec]

400

10 2000

3000

45

450

3000

0.08MB

=0.00

inliers

50 2000

0.22MB

0.2 0.1

=1.00

inliers

1.4

Execution time [sec]

0 2000

Household data set, p k=200 k=600 k=1000

Pruned objects

0.2

0.7 Maximum number of index nodes [%]

Maximum number of index nodes [%]

0.6

0.4

=0.05

inliers

k=200 k=600 k=1000

0.39MB

Maximum number of index nodes [%]

Household data set, p 0.7

4:37

3000

4000

5000 Radius

6000

7000

8000

PR1 PR2 PR3

0

10 2000

3000

4000

5000 6000 Radius [R]

7000

8000

Fig. 8. Household: execution time and effectiveness of pruning rules.

by using pinliers = 0.05, almost all the inliers are pruned by PR1, but also PR2 and PR3 may be effective, even if the number of objects pruned by the last two rules is more than one order of magnitude smaller. Finally, using a greater value for pinliers has the effect of increasing the size of INDEX, but without significantly augmenting the number of objects pruned by PR1. Thus, due to the presence of redundant objects, the range query performs worse, and the total execution time increases. Interestingly, by forgetting most of the information already seen, both the spatial and temporal complexities are improved. Among these three values of pinliers , the best trade-off between space occupancy and execution time is reached for 0.05. 8.6 Sensitivity to Parameters R and k Figure 10 shows the behavior of DOLPHIN on the datasets Color Histogram, DARPA 1998, Forest, and Server for various values of k and R, and pinliers = 0.05. The maximum size of INDEX (first column) is always a small fraction of the dataset, and it depends on the dataset and the parameters. The maximum size of INDEX is below the 4% for Color Histogram, 1.5% for DARPA 1998, and 2.5% for Server. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.



F. Angiulli and F. Fassetti Landsat data set, pinliers=0.00

Landsat data set, pinliers=0.05

0.4 0.24MB

0.3

0.2 0.11MB

0.1

0 0.4

0.5

0.6 0.7 0.8 Radius [R] Landsat data set, p =0.00

0.8 0.52MB

0.6

0.4

0.26MB

0.2

0 0.4

0.9

0.5

30 20

0.5

0.6 0.7 0.8 Radius [R] Landsat data set, k=275, pinliers=0.00

0.9

0.5

0.6 0.7 0.8 Radius [R] Landsat data set, p =1.00

60

0.9

k=55 k=165 k=275

50

20 15

40 30 20 10

0.5

0.6 0.7 0.8 Radius [R] Landsat data set, k=275, pinliers=0.05

0 0.4

0.9

6

10

0.5

0.6 0.7 0.8 Radius [R] Landsat data set, k=275, pinliers=1.00

0.9

10

5

10

3

10

2

10

3

10

2

10

1

PR1 PR2 PR3 0.5

4

10

10

0

0.6 0.7 Radius [R]

0.8

0.9

Deleted objects

4

10

Deleted objects

Deleted objects

0 0.4

inliers

25

6

5

0

0.59MB

0.5

0.9

30

0 0.4

10

10 0.4

1

5

10

10

1.14MB

k=55 k=165 k=275

10

10

1

1.5

Execution time [sec]

40

6

1.54MB

k=55 k=165 k=275

35 Execution time [sec]

Execution time [sec]

40

50

0 0.4

0.6 0.7 0.8 Radius [R] Landsat data set, p =0.05

2

inliers

k=55 k=165 k=275

60

k=55 k=165 k=275

0.72MB

inliers

70

Landsat data set, pinliers=1.00

1

k=55 k=165 k=275

0.35MB

Maximum number of index nodes [%]

Maximum number of index nodes [%]

0.5

Maximum number of index nodes [%]

4:38

10 0.4

PR1 PR2 PR3 0.5

5

10

4

10

3

0.6 0.7 Radius [R]

0.8

0.9

10 0.4

PR1 PR2 PR3 0.5

0.6 0.7 Radius [R]

0.8

0.9

Fig. 9. Landsat: execution time and effectiveness of pruning rules.

As for the Forest dataset, by setting k = 581 and R = 496.8, the size of INDEX is the 8.5% of the dataset. This is due to the very large number of objects classified as outliers when these parameters are used. As a matter of fact, in this case, about 5000 objects of the dataset are classified as outliers, that is 1%, a very high percentage. However, if the number of outliers is reduced to reasonable values (e.g., to about 1000; 2‰, for k = 348, and R = 496.8), the maximum size decreases sensibly to about the 5%. The amount of main memory used is in any case below one megabyte, except for Forest which has greater memory requirements (about 8MB for k = 348 and R = 496.8). The curves of the execution time (second column) confirm the behavior previously observed on other datasets. The algorithm has good performances even when large values of k and small values of R are employed, and, consequently, a considerable number of outliers is found. As for the pruning rules (third column), PR1 is effective on all combinations of the parameters. For Forest with k = 581, when the radius R decreases, PR3 is applied as frequently as PR1. This indicates that a large fraction of the data lies in relatively sparse regions of the space for that value of R, and explains the very large number of outliers. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.



DOLPHIN: Algorithm for Mining Distance-Based Outliers ColorHistogram data set, p

ColorHistogram data set, pinliers=0.05 k=13 k=40 k=68

3 0.31MB

2.5 2 1.5 0.14MB

25 20 15 10

0.5

5 0.6

0.4

=0.05

0.6

0.7

0.5

0.7

inliers

10 k=100 k=300 k=500

5

10 Deleted objects

Execution time [sec]

0.62MB

0.4 0.5 0.6 Radius [R] DARPA 1998 data set, k=500, p =0.05

6

200

1

150

100

4

10

3

10

2

10

0.25MB

50

0

9

0.6

0.8 1 1.2 Radius [R] Forest data set, pinliers=0.05

1.4

13.69MB

0

1.6

0.6

0.8

1 1.2 Radius [R]

1.4

4 3 3.03MB

1.4

10

1500

1000

4

10

3

10

2

10

500 1

500

550 600 650 700 Radius [R] Server data set, p =0.05, index=PIVOT

0 450

750

PR1 PR2 PR3

2.5

550

600 650 Radius [R]

700

750

=0.05, index=PIVOT

140

1.5 0.53MB

1

k=100 k=300 k=500

120 Execution time [sec]

2

500

10 450

inliers

k=100 k=300 k=500

0.92MB

0

Server data set, p

inliers

500 550 600 650 700 Radius [R] Server data set, pinliers=0.05, index=PIVOT

6

10

80 60

4

10

2

10

40

0.17MB

2000

4000 6000 Radius [R]

8000

10000

750

100

PR1 PR2 PR3

20

0 0

1.6

=0.05

inliers

10

0.5

1 1.2 Radius [R]

5

1

0 450

0.8

6

Deleted objects

8.30MB

5

0.6

10 k=116 k=348 k=581

7

2

1.6

Forest data set, k=581, p

2000

6

PR1 PR2 PR3

0

10

Forest data set, pinliers=0.05

k=116 k=348 k=581

8

1

10

Deleted objects

Maximum number of index nodes [%]

0.5 Radius [R]

250

k=99 k=299 k=499

0.99MB

Execution time [sec]

Maximum number of index nodes [%]

1.5

PR1 PR2 PR3

10 0.3

DARPA 1998 data set, pinliers=0.05

inliers

Maximum number of index nodes [%]

4

10

3

0 0.3

0.7

ColorHistogram data set, k=68, pinliers=0.05

30

1

0.4 0.5 Radius [R] DARPA 1998 data set, p

5

10

k=13 k=40 k=68

35

Deleted objects

3.5

0 0.3

=0.05

inliers

40

0.46MB

Execution time [sec]

Maximum number of index nodes [%]

4

4:39

0 0

0

2000

4000 6000 Radius [R]

8000

10000

10

0

2000

4000 6000 Radius

8000

10000

Fig. 10. Sensitivity to parameters R and k.

8.7 Choice of Parameter R If no external information is available about meaningful values for the parameters R and k, they could be set by trial and error. However, in this section, we show a method to set the parameter R to significant values without trial and error. The method is based on the sampling theory. 8.7.1 Parameter Estimation Method. The idea underlying the method is to relate the meaningfulness of the parameters to the number of outliers mined through such parameters, and to estimate the number of outliers exploiting a sample of a statistically significant size. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:40



F. Angiulli and F. Fassetti

In particular, let N be the number of dataset objects and let α denote the percentage of outliers to be detected. Then, once the parameter  is fixed, the parameter R such that the α percentage of the dataset objects has less than N neighbors within distance R can be estimated by means of the following procedure. Procedure. DolphinParamEstim Input: the dataset DS, the sample size n, the parameter , the percentage of outliers α Output: the parameter R Method: 1: pick a random sample of n objects from DS 2: for each object obj of the sample 3: compute the distance d obj from obj to its (n)-th neighbor in the sample 4: return the value R such that exactly αn objects obj of the sample have d obj greater than R

In order for the aforesaid procedure to be effective, a meaningful value for the sample size n must be employed. Now, it is shown how to set the size of the sample in order to have a statistical guarantee that the actual percentage α of outliers in the whole dataset, detected when the value of R returned by the procedure is employed, is close to the percentage α of outliers in the sample. With this aim, the following relation must be satisfied. Pr[|α − α| ≤ ] > 1 − δ

(9)

The preceding relation asserts that the probability that the estimation error (which is the difference between α and α) is lower than an error threshold is greater than 1 − δ. Clear enough, the lower and δ, the closer α to α. By the well-known Central Limit theorem, if the sample size n is large enough, then the following relation holds. √  

n Pr [|α − α| ≤ ] ≈ 2 · √ −1 α(1 − α) By exploiting the previous result, it can be obtained that relation Eq. (9) is satisfied if   2 α(1 − α) δ −1 n>

1− (10) 2

2 holds.7 Note that the lower and δ, the greater the sample size n, and then the greater the computational cost of the procedure DolphinEstimParam. 8.7.2 Computation of Parameters in Real Datasets. Next, the procedure described in the previous section is tested on the datasets reported in Table III. In all the experiments the fraction α of outliers was set to 3‰, while the fraction  of neighbors to be considered was set to 1‰. Let = 1‰ and δ = 0.1, so that the estimation error is smaller than 1‰ with probability 0.9, or, equivalently, so that the number of outliers in the dataset 7 For

details, the reader is referred to Watanabe [2000].

ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:41

detected by using the parameters estimated with that sample size is between the 2‰ and the 4‰ with probability 0.9. Then, by using Eq. (10), the sample size to be employed is n = 8093. The results of the experiments are reported in the following table. Dataset Color Histogram DARPA 1998 Forest Cover Type Household Landsat Server

R 0.35 0.53 601.13 1,792.18 0.40 1,680.34

Actual Outliers 253 (3.72‰) 1,608 (3.22‰) 1,258 (2.17‰) 3,068 (3.07‰) 894 (3.25‰) 1,225 (2.45‰)

Expected Outliers 204 (3‰) 1,498 (3‰) 1,743 (3‰) 3,000 (3‰) 826 (3‰) 1,500 (3‰)

The second column reports the radius returned by DolphinParamEstim, the third column reports the actual number of outliers mined by DOLPHIN by using that radius, and the fourth one reports the expected number of outliers, namely the 3‰ of the input dataset size. Note that by using the radius computed by DolphinParamEstim, the estimation error, which is the difference between the percentage of the actual outliers and the percentage of the expected outliers, is in all cases smaller than 1‰, that is to say, the value chosen for . 8.8 Comparison with Other Methods In this section, comparison with the distance-based outlier detection algorithms ORCA,8 SNIF,9 and DOLPHIN is described. The algorithms were compared through scalability analysis. These experiments have a twofold usefulness: (i) to validate the theoretical analysis of Section 5.2 concerning the scaling behavior of DOLPHIN, and also (ii) to confirm the qualitative comparison of Section 7.1, showing that DOLPHIN scales better than competitor methods. The parameters of the three algorithms were set to achieve their best performances. As far as the DOLPHIN algorithm is concerned, the parameter pinliers was set to 0.05. For the ORCA algorithm, the distance to the kth nearest neighbor was used as outlier score, the cutoff value was set to R, while the number of outliers to be found was set to a value greater than the number of outliers in the dataset. The buffer size was set to the optimal value, 1000 recommended by the authors. As far as the SNIF algorithm is concerned, the number s of centroids was set to 1000, while the maximum number M of objects that can be accommodated into the memory buffer was set to 10% of the dataset size, as recommended by the authors. In the experiments, we distinguished the CPU time from the I/O cost. As for I/O cost, a database system scenario is taken in account, where the background caching mechanism of the operating system is not available. In order to simulate the I/O time suffered by the system, the number of sequential 8 http://www.isle.org/∼sbay/software/orca/. 9 http://www.cse.cuhk.edu.hk/∼taoyf/paper/kdd06.html.

ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:42



F. Angiulli and F. Fassetti

Household Dataset, k=300, R=2600.781 800

CPU Time [sec]

CPU Time [sec]

500 400 300

20

100 80 60 40

200

10

5

20 40 60 80 Dataset Size [%] DARPA 1998 Dataset, k=150, R=0.527

100

0 0

20

40 60 Dataset Size [%]

80

0 0

100

Forest Dataset, k=174, R=496.826 10000

Dolphin Orca Snif

8000

300

20

30 40 50 60 70 80 Dataset Size [%] Server Dataset, k=150, R=1529.404

90

100

350 300 CPU Time [sec]

CPU Time [sec]

400

10

400

Dolphin Orca Snif

500 CPU Time [sec]

15

20

100

600

Dolphin Orca Snif

120

600

700

25

Dolphin Orca Snif

140

700

0 0

Color Histogram Dataset, k=20, R=0.343

Landsat Dataset, k=83, R=0.451 160

Dolphin Orca Snif

CPU Time [sec]

900

6000

4000

200

250 200 150 100

2000

100 0 0

50

20

40 60 Dataset Size [%]

80

100

0 0

20

40 60 Dataset Size [%]

80

100

0 0

20

40 60 Dataset Size [%]

80

100

Fig. 11. Scaling analysis of ORCA, SNIF, and DOLPHIN Algorithm (CPU time).

and random disk page accesses was counted, and then each access was charged with some units of time. In particular, it is assumed that a sequential page access requires 2 milliseconds, whereas a random page access requires 20 milliseconds.10 An access to the ith disk page is assumed to be sequential if the last disk page accessed by the algorithm is the (i − 1)-th one, and random otherwise. Figure 11 reports the CPU execution time of DOLPHIN, ORCA, and SNIF on the Household, Landsat, Color Histogram, DARPA 1998, Forest, and Server datasets. In each experiment, the size of the dataset was varied between 10%– 100% of the whole size. The parameter R was fixed to the value Rmin defined in Section 8.2, while the parameter k was set to the 0.03% of the whole dataset size. The value for the parameter k was obtained by first selecting the intermediate value used in the experiments of the aforesaid sections, that is, 0.06%, and then halving it, since the algorithms have to be executed on datasets of increasing size. The number of outliers in the whole dataset can be estimated by looking at the curves reported in Figure 6. Figure 11 shows the results of the experiments concerning the CPU execution times. Interestingly, DOLPHIN scales linearly with respect the dataset size, thus confirming the analysis of Section 5.2. Also ORCA appears to scale nearly linearly, but the slope of its curve is greater than that of the curve of DOLPHIN, hence confirming that the latter algorithm is more efficient, as discussed in Section 7.1. As for SNIF, it seems to scale worse than the other methods, since often the slope of its curve increases with the dataset size. DOLPHIN is the 10 Both

the described setting and the employed page access times have been suggested by one of the anonymous reviewers.

ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers Household Dataset, k=300, R=2600.781 4

10

5

Landsat Dataset, k=83, R=0.451

4

10

Dolphin Orca Snif

10

4

Color Histogram Dataset, k=20, R=0.343 Dolphin Orca Snif

10 I/O Time [sec]

I/O Time [sec]

I/O Time [sec]

2

10

4:43

3

10 3

10



3

10

2

2

10

1

10

10

1

10

1

5

10

20

40 60 Dataset Size [%]

80

10

100

DARPA 199 Dataset, k=150, R=0.527

5

10

Dolphin Orca Snif

4

I/O Time

I/O Time [sec]

40 60 Dataset Size [%]

80

10

100

Forest Dataset, k=174, R=496.826

3

2

10

Dolphin Orca Snif

3

10

40 60 Dataset Size [%]

80

100

10

0

80

100

Dolphin Orca Snif

2

10

1

10

1

20

40 60 Dataset Size [%]

Server Dataset, k=150, R=1529.404

3

10

1

20

10

2

10

0

4

10

10

0

0

20

4

10

10

0

I/O Time [sec]

0

0

20

40 60 Dataset Size [%]

80

100

10

0

20

40 60 Dataset Size [%]

80

100

Fig. 12. Scaling analysis of ORCA, SNIF, and DOLPHIN (I/O time).

fastest method. It outperforms the other methods in all the experiments. For example, on the Forest dataset DOLPHIN took 439 seconds to complete its work, while SNIF required 3280 seconds and ORCA 9591 seconds. As a further example, on the Household dataset, DOLPHIN terminated after 15 seconds, while ORCA took 419 seconds and SNIF 882 seconds. Figure 12 reports the evaluated I/O time. The diagrams are in logarithmic scale to include the I/O time of ORCA, which is clearly greater than that of the other two methods. It can be noticed that the growth of both DOLPHIN and SNIF is linear with respect to the dataset size. Recall that in logarithmic scale the curve associated with a polynomial grows logarithmically. The curves appear to be very close each other, but SNIF performs worse than DOLPHIN. Indeed, since the gap between the two curves is constant, it follows that the slope of the curve associated with DOLPHIN is smaller than that of the curve associated with SNIF. This can be explained by noticing that while DOLPHIN always performs two sequential scans, SNIF performs two or three sequential scans and could also need to access the verification file. As for the growth of ORCA, according to the discussion in Section 7.1, it scales quadratically with respect to the dataset size and then ORCA performs much worse than the competitor methods. The same kind of experiments were executed also for other values of the parameter k and showed always a similar behavior. It follows from the preceding comparison that both DOLPHIN and SNIF are efficient in terms of I/O cost since they scale linearly, though DOLPHIN scales better than SNIF, while ORCA is the least competitive. Conversely, both DOLPHIN and ORCA are efficient in terms of CPU time, with DOLPHIN performing noticeably better than ORCA, while SNIF is the poorest. Summarizing, DOLPHIN outperforms rival methods both in terms of I/O and CPU cost. The experiments agree with the analysis given in Section 7.1. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:44



F. Angiulli and F. Fassetti

Note that, due to the different way they are evaluated, it is not meaningful to sum the CPU time and the I/O time before reported. Infact, maintaining the CPU and I/O time separately helps in understanding algorithm behavior, since, depending on the scenario addressed in the outlier mining, either CPU time or I/O cost (or both) could be the operation to be optimized. Indeed, in some scenarios the I/O cost could be a very expensive operation, so that the main goal should be to minimize it. Conversely, there are some other scenarios in which the minimization of the CPU computational cost is the challenge; for example, when the distance computation is very expensive. Before concluding, in order to keep an intuition about the total cost of the algorithms in the standard setting in which the file system is used to access disk data, the I/O time has been also evaluated by measuring the time employed by the operating system to perform the I/O operations. In this case, the total execution time of DOLPHIN and SNIF almost coincides with the CPU time reported earlier, while that of ORCA presents an overhead with respect to the associated CPU time. 8.9 Fixed Memory DOLPHIN In this section the behavior of the fixed-memory DOLPHIN algorithm is empirically tested. The algorithm was run by employing the same values for the parameters k and R considered in the preceding section. The parameter pinliers was set to 0.05. As for the parameter ν, different combinations were tested. The best results were obtained for ν = 10%, and hence in the following this value for ν is employed. The size of the memory buffer was varied between 128kB–4MB, and the execution time and the amount of main memory actually employed were measured. Table V reports the results of the experiments. Each entry of the table shows the execution time in seconds (first row), the number of objects allocated by the algorithm into the buffer as a percentage of the dataset size (second row), and also their actual number (third row). Note that by examining the table from right to left, the buffer size halves at each column. Entries associated with the best execution times are highlighted in boldface font. The best temporal performance is achieved when the memory buffer is sufficiently large to accommodate all the objects required by the standard DOLPHIN algorithm. In this case, fixed-memory DOLPHIN behaves as standard DOLPHIN, and enlarging the buffer size beyond this limit is not beneficial. It is important to point out that the fraction of dataset objects stored by DOLPHIN into INDEX is always very small. In these experiments the maximum fraction of dataset objects stored in main memory is 2.51%, which is the one for the Forest dataset, while, on the other datasets, this fraction is almost always smaller than the 1%. However, from Table V, it is clear that DOLPHIN may work even if the available memory is smaller than its standard requirements, even if at the expense of an increase of the elaboration time. Consider the best buffer size b∗ , a generic buffer size b, with b∗ ≥ b, and the associated execution times t ∗ and t, respectively, with t ≥ t ∗ . Interestingly, in ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:45

Table V. Experimental Results Using Fixed-Memory DOLPHIN Dataset/Buffer Size Color Histogram R = 0.343 k = 20 DARPA 1998 R = 0.527 k = 150 Forest R = 496.826 k = 174 Household R = 2,600.781 k = 300 Landsat R = 0.451 k = 83 Server R = 1,529.404 k = 150

128kB

256kB

512kB

1MB

2MB

4MB

14.44 1.08% 733 479.78 0.17% 861 2,272.57 0.08% 483 17.64 0.16% 1,586 31.30 0.16% 446 79.28 0.33% 1,656

12.92 1.59% 1,079 102.64 0.34% 1,684 1,632.34 0.16% 959 17.64 0.16% 1,586 19.19 0.32% 886 59.12 0.63% 3,155

12.92 1.59% 1,079 65.31 0.47% 2,336 1,190.98 0.33% 1,897 17.64 0.16% 1,586 15.19 0.43% 1,175 22.2 0.65% 3,259

12.92 1.59% 1,079 65.31 0.47% 2,336 954.14 0.65% 3,749 17.64 0.16% 1,586 15.19 0.43% 1,175 22.2 0.65% 3,259

12.92 1.59% 1,079 65.31 0.47% 2,336 893.89 1.27% 7,400 17.64 0.16% 1,586 15.19 0.43% 1,175 22.2 0.65% 3,259

12.92 1.59% 1,079 65.31 0.47% 2,336 753.83 2.51% 14,603 17.64 0.16% 1,586 15.19 0.43% 1,175 22.2 0.65% 3,259

a large range of values for b, the increase of execution time, that is, the factor t = t/t ∗ , is smaller than the decrease of available memory, that is, b = b∗ /b. For example, on the Landsat (Server, respectively) dataset, for b = 128kB, that is for b = 4, the time increased of a factor t = 2 (t less than four, respectively). On the Forest dataset, for b = 128kB and assuming b∗ = 4MB, that is, a decrease factor b = 32 for the buffer size, the time increases by only of a factor t = 4. On the DARPA 1998 dataset, for b = 256kB (b∗ = 512) the ratio t is less than two, while for b = 128kB the time increased of a factor t = 8, which is greater than the decrease b = 4 of the buffer size. t Note, however, that when the ratio b sensibly deteriorates, the number of objects that can be accommodated into the buffer is close to the value of the parameter k. Clearly enough, such values for the buffer size are not meaningful. Summarizing, these experiments have shown that DOLPHIN can be executed even if the memory is smaller than that required by the standard method. The increase of time is reasonable, that is, provided that the number of objects fitting into the buffer is not very close to the value of the parameter k, then the execution time increases less than linearly with the memory buffer reduction. In any case, even with very small buffers, DOLPHIN behaves better than competitor methods tuned to achieve their best performance. 8.10 Curse of Dimensionality In this section we analyze how the performance of the INDEX data structure, and, consequently, of the algorithm, is affected by the curse of dimensionality. Since it is known that the performance of similarity search methods degrades as the dimensionality of the data increases, in order to understand the impact ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:46



F. Angiulli and F. Fassetti

of the pivoting-based search on the overall execution time of the algorithm, two versions of the DOLPHIN algorithm were compared. The first version is the standard one, used in all the previous experiments, employing a pivoting-based indexing technique, also called DOLPHIN-pivot in the sequel. The second version, called DOLPHIN-naive, employs a naive index, that is, the INDEX data structure simply stores DBO-nodes. When a range query search is submitted, the naive index returns as candidate neighbors all the objects there stored. The following table compares the execution times of the two methods on the dataset previously considered, sorted by increasing number of attributes. The parameter R has been set to the value Rmin defined in Section 8.2, and the parameter k to 0.03% of the dataset size (as already done in previous sections). Dataset

Household

Server

DARPA 1998

Color Histogram

Forest

Landsat

d

3

5

24

32

54

60

Pivot Naive

14.89 118.36

17.53 97.36

55.31 99.58

11.70 8.73

439.34 3,026.28

13.06 31.69

The DOLPHIN-pivot method behaves better than DOLPHIN-naive in almost all the experiments, independent of the number of attributes of the dataset. It is known that the number of attributes of a dataset is not directly related to the effectiveness of the range query search on that dataset. For example, ´ in Chavez et al. [2001] the concept of intrinsic dimensionality of a dataset is defined, which provides a way to measure how challenging a dataset is for a similarity search method. As a matter of fact, from the preceding table it can be noted that on the Landsat and Forest datasets, DOLPHIN-pivot outperforms DOLPHIN-naive even though they are the highest-dimenensional datasets considered, while on the Color Histogram dataset, which has 32 attributes, DOLPHIN-naive is faster than DOLPHIN-pivot. The different behavior of the algorithms on these datasets can be understood by looking at the experiments reported in Figure 13. In these experiments three datasets were considered: Color Histogram, Landsat, and Isolet.11 The last dataset has been chosen in order to include a very high-dimensional one. It is composed by 6234 objects, each having 617 attributes, and contains information about the alphabet as spoken by some subjects. The attributes include spectral coefficients, contour features, sonorant features, presonorant features, and postsonorant features. For each dataset, the number of attributes has been varied from 1 to the total number of attributes, by projecting the dataset on a randomly selected subset of the attributes, and the execution times of DOLPHIN-pivot and DOLPHINnaive have been measured. 11 archive.ics.uci.edu/ml/datasets/ISOLET.

ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.



DOLPHIN: Algorithm for Mining Distance-Based Outliers Color Histogram Dataset, k=20 25

Landsat Dataset, k=276 160

Pivot Naive

140 Execution Time [sec]

20 Execution Time [sec]

4:47

15

10

Pivot Naive

120 100 80 60 40

5 20

0 0

8

16 Number of Attributes

24

0 0

32

Execution Time [sec]

1.4

Pivot Naive

1.2 Execution Time [sec]

0.4 0.35 0.3 0.25 0.2 0.15

20 30 40 Number of Attributes

50

60

Pivot Naive

1 0.8 0.6 0.4 0.2

0.1 0.05 0

10

Isolet Dataset, k=6

Isolet Dataset, k=6

20

40 60 Number of Attributes

80

100

0 0

100

200 300 400 500 Number of Attributes

600

700

Fig. 13. Dimensionality analysis.

In the experiments the parameter R is chosen by means of the procedure described in Section 8.7, so that the number of outliers is about the 3‰ of the dataset size for each execution. The parameter k is set to 6, which is the 1‰ of the dataset size, for Isolet, and it is the same as before for the other two datasets. On the Color Histogram dataset the relative performance of DOLPHINpivot degrades with respect to that of DOLPHIN-naive as the number of attributes increases, even though the former performs better up to about 29 dimensions. Conversely, on the Landsat dataset, DOLPHIN-pivot is always faster than DOLPHIN-naive and the relative performance appears not to degrade sensibly. Finally, on the Isolet dataset the behavior of the two methods is similar to that observed on Color Histogram. Specifically, DOLPHINpivot outperforms DOLPHIN-naive up to about 11 dimensions. Remarkably, for higher dimensionality, the performance of DOLPHIN-pivot is close to that of DOLPHIN-naive. Indeed, in this case the range query search does not take advantage of the pivot-based indexing technique and reduces to a sequential scan of all the objects stored in the index. DOLPHIN-pivot performs slightly worse, due to the overhead in computing distances among the query object and all the pivots. 9. CONCLUSIONS In this article the DOLPHIN algorithm, a distance-based outlier detection method specifically designed to work with disk-resident datasets, has been presented. The I/O cost of the algorithm is very low, since it corresponds to the cost of sequentially reading the input dataset file twice. Both theoretical ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:48



F. Angiulli and F. Fassetti

justification and empirical validation of the efficiency of the method have been provided. An upper bound to the spatial requirements of the method has been derived. It has been shown that for meaningful combinations of the parameters, the main memory usage corresponds to a small fraction of the overall dataset. As a major result, based on the notion of unification between outlier definitions and on the concept of an outlier region for a statistical distribution, the upper bound has been theoretically validated on standard distributions. As far as the temporal cost of the algorithm is concerned, it has been shown that DOLPHIN has linear-time performance in the dataset size. DOLPHIN has been qualitatively and empirically compared with state-of-the-art algorithms. Experimental comparison through scalability analysis has shown that DOLPHIN is much more efficient than competitors, thus confirming the qualitative analysis. In order to deal with the scenario in which the available memory is smaller than the standard requirements of DOLPHIN, its basic schema has been modified, and fixed-memory DOLPHIN has been introduced. The experiments have shown that the increase of execution time of fixed-memory DOLPHIN is reasonable. In any case, even with very small buffers, fixed-memory DOLPHIN behaves better than competitor methods. Summarizing, the presented algorithm has been shown very fast and able to efficiently handle enormous disk-resident collections of data. A lot of aspects concerning the algorithm are worth further exploration, among them: using a more sophisticated policy of deletion from the index, using indexing techniques other than pivoting (e.g., R-trees on low-dimensional data, or, in general, other high-dimensional indexing methods), applying the algorithm to specific types of data (e.g., strings), improving the usage of pivots (random pivots and a simple logarithmic rule to decide the number of pivots to employ were used here), and studying the behavior of the method in nonmetric spaces. APPENDICES A.

ANALYSIS ON THE LAPLACE DISTRIBUTION

Next, the parameter pLap when the data is distributed according to a Laplace distribution is computed. Definition A.1. Let DS be a set of values that is Laplace distributed with parameter μ and β. Define De f (Lap) as follows: An object x of DS is an outlier if and only if |x − μ| ≥ −β ln α. THEOREM A.2. A value x is an outlier according to De f (Lap) if and only if x belongs to  Lap . PROOF. In order to determine the outlier region, values a and b such that: (i) f (a) = f (b) and (ii) F (a) + (1 − F (b)) = α must be determined. Since the Laplace distribution is symmetric with respect to mean value μ, by condition ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:49

(i) it holds that 1 − F (b) = F (a) and it suffices to determine value a such that 2F (a) = α. For x < μ, the Laplace cumulative distribution function is F (x) =

1 e 2



x−μ β



.

As a consequence 

2F (a) = α

=⇒

e

a−μ β





=⇒

a = μ + β ln α.

Since the distribution is symmetric, the value of b is μ − β ln α, and hence the result follows. THEOREM A.3. The distance-based outlier definition unifies De f (Lap) with √ √ √ 2 2 parameters  Lap = (e 10 − e− 10 ) α2 ≈ 0.142α and R Lap = β102 . PROOF. Since the Laplace distribution function is symmetric, for simplicity, only the case x + R < μ is considered. √ √ 2 Starting from Proposition 4.2, and by setting R Lap to β102 and  Lap to (e 10 − √

e− 10 ) α2 , we obtain  2

 √  √   √ √  β 2 α β 2 2 2 F x+ −F x− ≤ e 10 − e− 10 =⇒ 10 10 2     √ √ √   √2 1 β1 x+ β102 −μ 1 1 x− β102 −μ α 2 − eβ ≤ e 10 − e− 10 e =⇒ 2 2 2 √ √ √ √   x−μ x−μ 2 2 2 2 e β + 10 − e β − 10 ≤ e 10 − e− 10 α =⇒

e

x−μ β

≤α

=⇒

x ≤ μ + β ln α.

Due to the modality computed, the previous result√is valid under the assumption 2 that (μ + β ln α) + R Lap ≤ μ, that is, that α ≤ e− 10 ≈ 0.87, which can be safely assumed since reasonable values for α are less than 0.01, α being the fraction of objects to be considered outliers. Now the value pLap can be obtained. THEOREM A.4.

The value of the parameter pLap is

√ 0.142α− 2 . ln α

PROOF. By definition of De f (Lap) the region  Lap is [μ + β ln α, μ − β ln α], and then c Lap = 1/V( Lap ) = − 2β 1ln α . To conclude, according to Theorem 4.6, the probability pLap is computed as follows.

pLap = c Lap (F (u + R Lap ) − F (u − R Lap )) du  Lap

1 =− 2β ln α =

α  √2 e 10 2 ln α



 √  √  β 2 β 2 F u+ − F u− du 10 10 μ+β ln α √ √ √  2 0.142α − 2 − 102 − ≈ −e 10 ln α ln α



μ−β ln α

ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:50

B.



F. Angiulli and F. Fassetti

SURVEY OF RELATED METHODS

In the rest of the section the block nested loop and cell-based (Section B.1), ORCA (Section B.2), RBRP (Section B.3), and SNIF (Section B.4) algorithms are surveyed, and, finally, Section 7.1 is devoted to the comparison of DOLPHIN with related methods. B.1

Block Nested Loop and Cell-Based Algorithms

In Knorr and Ng [1998] the authors present three algorithms. The first one is based on a block-oriented nested loop whose complexity is O(dN2 ), where d is the dimensionality and N the size of the dataset. The second algorithm is a cell-based one, whose complexity is linear with respect to N and exponential with respect to d . Finally, the third one is a variant of the cell-based algorithm, expressly designed for disk-resident datasets. It requires at most three passes over the dataset. Next these three algorithms are detailed. The first algorithm is called Algorithm-NL. Let B denote the buffer size. The algorithm splits the buffer into two halves, called first array and second array. Then, it reads the first dataset block, whose size is B2 , into the first array. For each object obj in the first array, a counter storing the number of objects within distance R from obj is maintained. The counting stops when the number of neighbors exceeds k. First of all, each object obj is compared with the other objects in the first array. Then the algorithm fills the second array with a different dataset block, until either all the first array objects have at least k neighbors or all the dataset blocks are considered. At the end of this process, two blocks are currently stored in the two arrays. Thus, the role of the objects into the two arrays is switched, that is, the second array becomes the first array and vice versa, and the process restarts. The algorithm terminates after all the dataset blocks have been loaded into the first array of the buffer and processed according to the strategy described. N Let b = B/2 be the number of dataset blocks. Each of the b blocks has to be read once during the first iteration. Since at the end of each iteration two blocks are retained in memory, during each succeeding iteration only b−2 blocks must be read. Therefore, the total number of blocks to be read is b + (b − 1)(b − 2). This algorithm has the same complexity of the naive nested loop one, that is, the O(dN2 ), but reduces the I/O cost. Furthermore, the authors propose two versions of a cell-based algorithm. The first one is designed to work with memory-resident dataset, whereas the second algorithm deals with a disk-resident dataset, and then, aims to minimize the number of passes over the dataset. The cell-based algorithm for memory-resident datasets is called FindAllOutsM. This algorithm partitions the d-dimensional space into cells of length R √ . The Layer 1 (L1 ) neighbors of a cell C are the cells adjacent to C, or, more 2 d formally, L1 (Cx1 ,...,xd ) = {Cu1 ,...,ud | xi − 1 ≤ ui ≤ xi + 1, Cu1 ,...,ud = Cx1 ,...,xd }, ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



whereas layer 2 (L2 ) is defined as √ √ L2 (Cx1 ,...,xd ) = {Cu1 ,...,ud | xi − 2 d  ≤ ui ≤ xi + 2 d , Cu1 ,...,ud ∈ L1 (Cx1 ,...,xd ), Cu1 ,...,ud = Cx1 ,...,xd }.

4:51

(11)

The algorithm is based on the following three properties: (i) Any pair of objects within the same cell are at most distance R2 apart; (ii) any object of a generic cell C and any object of L1 (C) are at most distance R apart; (iii) any object of a generic cell C and any object of a generic cell C  = C such that C  is neither in L1 (C) nor in L2 (C) are at distance greater than R apart. As a direct consequence of these properties, the following propositions hold: (iv) If there are more than k objects in a cell C, none of the objects in C is an outlier; (v) if there are more than k objects in C ∪ L1 (C), none of the objects in C is an outlier; (vi) if there are less than k objects in C ∪ L1 (C) ∪ L2 (C) all the objects in C are outliers. FindAllOutsM exploits the aforementioned propositions to detect outliers on a cell-by-cell basis rather than an object-by-object basis. This reduces the number of comparisons, since a large number of nonoutlier objects can be quickly recognized. If none of the preceding propositions can be exploited, the algorithm has to resort to an object-by-object processing. The overall time of FindAllOutsM is O(N + m · c · k 2 ), where c √ complexity d is equal to (22 d  + 1) and m is the number of cells. It is worth noting that the both m and c are exponential with respect to d . Then, this algorithm is fast only for low-dimensional datasets. The authors provide experimental evidence that this algorithm is useful only when the number of dimensions is smaller than 4. The authors also present version of the this algorithm, called FindAllOutsD, designed for dealing with disk-resident datasets. B.2

ORCA Algorithm

In Bay and Schwabacher [2003], the authors introduce the ORCA distancebased outlier detection algorithm. They modify the simple algorithm based on nested loops, which has a quadratic scaling behavior, introducing randomization and a simple pruning rule. The authors show that, with these two simple modifications, the nested loop schema achieves superior performance. In particular, they claim that ORCA yields expected near-linear-time performance. ORCA mines the top-n distance-based outliers in the dataset at hand, according to either the definition introduced in Ramaswamy et al. [2000] or in Angiulli and Pizzuti [2002]. Optionally, the user may specify a cutoff value for the outlier score. In this case the algorithm recognizes as outliers only those objects whose score is greater than the cutoff. If n is set to an arbitrarily large value, for example, to the number N of dataset objects, and the cutoff is set to R, then ORCA mines the distance-based outliers according to the definition introduced in Knorr and Ng [1998]. Specifically, at each iteration ORCA sequentially reads from disk a block B of dataset objects. Then it scans the dataset from the beginning for building the ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:52



F. Angiulli and F. Fassetti

list of the nearest neighbors of the objects in B. A score is associated to each object in B. This score can be the distance to the kth nearest neighbor, or the average distance to the k neighbors, based on the outlier definition employed. If an object in B achieves a score lower than the cutoff, it can be removed from B since it is not an outlier. The cutoff is the maximum between the value specified by the user and the score of the current top-nth candidate outlier. As more blocks are processed, the cutoff increases together with the pruning efficiency of the method. ORCA requires that the dataset is randomized. This operation can be accomplished in linear time and constant main memory with a disk-based algorithm. In the worst case ORCA requires O(N 2 ) distance computations and O( Nb · N ) data accesses, where N is the size of the dataset and b is the size of the block in terms of number of objects. The authors prove that in the average case, ORCA has near-linear-time performance, since, under certain conditions, the time to process inliers, which are the majority of points, does not depend on N . Indeed, suppose that the goal is to find outliers as the top-n objects whose distance to the kth nearest neighbor is greatest. Consider the number of distance computations needed to process an inlier x. This object will be compared with dataset objects until k neighbors are found within the cutoff distance, in which case it will be discarded since it cannot be an outlier. Then, the expected number of distances to be computed for recognizing x as an inlier is E[Y ] =

N  y=k

P (Y = y) · y =

k , π (x)

where π(x) is the probability that a randomly selected dataset object lies within cutoff distance from x, and Y is the random variable representing the number of trials required until k successes are obtained, which follows a negative binomial distribution. As for the outliers, the number of distances to be computed is N . Typically the parameters are set to return a small number of outliers, thus the term E[Y ] is dominant and ORCA achieves a near-linear-time scaling behavior. B.3

RBRP Algorithm

In Ghoting et al. [2006] an algorithm for mining distance-based outliers called RBRP (Recursive Binning and Re-Projection) is presented. The aim of RBRP is to find the top-n data points whose distance to their kth nearest neighbor is the greatest. The method consists of two phases. During the first one, RBRP partitions the dataset into bins such that points close to each other in the space are likely to be assigned to the same bin. The second phase is basically the ORCA nested loop algorithm adapted to work with the dataset organized in bins. As far as the first phase is concerned, a divisive hierarchical clustering is employed. The procedure selects κ centers at random, where κ is an user-provided parameter, and then assigns each object to the closest center, thus creating κ partitions. Next, the procedure is recursively executed on each partition for at ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:53

least a specified number of times and, in any case, until the size of the partition is below a user-defined threshold. After having partitioned the dataset, the objects belonging to the same bin are organized according to the projection along the principal component, that is, the axis of maximal variance. This arrangement accelerates detection of neighbors when a bin is sequentially scanned. The authors assert that the time complexity to accomplish the first phase is O(N log N · d ) in the average case, where N is the dataset size and d the number of dimensions. As far as the second phase is concerned, an extension of the ORCA nested loop algorithm is employed. For each dataset object obj , the bin B which obj belongs to is scanned in order to search for neighbors. If at least k neighbors of obj are found, then obj is an inlier and the search stops. Otherwise, the bin closest to B is considered, and so on, until either k neighbors within the cutoff distance are found or until all the bins have been examined. The worst-case time complexity of the second phase is O(N 2 ), but authors claim that for a nonoutlier there is the expectation of finding the approximate nearest neighbors in the very same bin, whereas for an outlier point all the bins have to be examined. Since the number of outliers is much smaller than the dataset size, they claim that the second phase scales as O(N · d ). Thus, the overall time complexity for RBRP is O(N log N · d ). The key difference between ORCA is that RBRP, by means of data rearrangement provided by the first phase, is expected to recognize a nonoutlier point in far less time than ORCA, whereas both ORCA and RBRP need to scan the entire dataset for outlier objects. Moreover, the dataset has to entirely loaded in memory in order to partition it into bins and in order to efficiently access each bin. B.4

SNIF Algorithm

In Tao et al. [2006] the authors present an algorithm, called SNIF (for ScaN with prIoritized Flushing), that aims to return all outliers while keeping low the I/O cost. They pointed out that for typical values of the parameters, ORCA has quadratic I/O cost, whereas SNIF is able to accomplish the task by scanning the dataset three times, and by using, under certain conditions, a small amount of main memory. Furthermore, they develop a second version of their algorithm that scans, in most cases, only twice the dataset but needs additional memory usage. The proposed algorithm requires that the dataset is randomized, and that a metric distance is defined on the dataset objects. SNIF declares as outliers those data points having less than k neighbors within distance R. First of all, the method randomly selects a set S of s dataset objects. The objects in the set S are employed in order to build s partitions PA1 , . . . , PAs of the dataset. In particular, each object in S is the centroid of a partition. A dataset object obj belongs to the partition PA if (i) its distance from the centroid of PA is smaller than R2 and (ii) obj is nearer to the centroid of PA than to the ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:54



F. Angiulli and F. Fassetti

centroid of any other partition. Obviously, the union of all these partitions may not be the dataset, since an object can be farther away from all centroids than R . Since the dataset is randomized, the centroids can be the first s objects of 2 the dataset. Then, the dataset is scanned and for each partition P A the number of objects belonging to PA is counted. If this number exceeds k, then PA is labeled as good. All the partitions not labeled as good, are labeled as bad. Next, a second scan of the dataset is performed. If a dataset object obj belongs to a good partition, for the triangle inequality it cannot be an outlier; hence, it is discarded. Nevertheless, if the object obj either does not belong to any partition, or belongs to a bad partition, then it is retained in memory. Thus, the amount of memory required by SNIF increases. Finally, a third scan of the dataset is performed, in order to detect outliers among the objects retained in memory. The authors provide a qualitative analysis for the memory consumption based on statistical considerations. Let N be the dataset size. For an object oi , let oi .n≤R/2 denote the number of dataset objects lying within distance R2 from oi . Then, the expected number of objects not belonging to any partition is N  

1−

i=1

oi .n≤R/2 s . N

Let nd ense be the number of dataset objects whose oi .n≤R/2 value is at least k. The probability that a centroid produces a bad partition is 1 − ndNense , and, then,   the expected number of bad partitions is s · 1 − ndNense . Since each bad partition contains at most k − 1 objects, the expected number of objects belonging to bad partitions is  nd ense  (k − 1) · s · 1 − . N In conclusion, the expected number of objects to be retained in main memory is  N  oi .n≤R/2 s nd ense  M 0 = s + i=1 1− + (k − 1) · s · 1 − . (12) N N If the memory available is enough to store M 0 objects, then SNIF accomplishes its task in three scans, yielding linear I/O time performance. As for the second version of SNIF algorithm, it retains more objects in memory in order to reduce the chances of performing the third scan. Initially, the method scans the dataset and keeps in memory all the objects. For each object obj read from the dataset, a counter obj.n stores the number of neighbors of obj encountered in the dataset so far. Say M ≥ M 0 is the number of objects that can be accommodated in main memory. When the memory becomes full, an event called the critical moment, SNIF performs a flushing, that is, it removes M objects from the memory. With 2 this aim, SNIF associates with each object a priority designed to reflect the likelihood that an object is an outlier. First of all, for each object obj stored in memory, the number of neighbors in N the whole dataset is estimated as obj.n M , where N is the dataset size. Among ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:55

the objects whose estimated number of neighbors is at least k, a set S of s objects is selected. These objects are employed as partition centroids. For each partition PA, PA.den denotes the number of objects belonging to it. The number of objects en·N in the whole dataset belonging to each partition P A is estimated as P A.d , Nseen where Nseen is the number of dataset objects already scanned. The partitions whose estimated density exceeds k are called g ood , whereas the other ones are called bad . Then, objects whose neighbor counter is greater than k have assigned lowest priority. Objects that belong to a good partition have assigned a lower priority than objects belonging to a bad partition. Objects that do not belong to any partition, have the highest priority. Objects are then sorted in ascending order of priority and the first M objects 2 are either discarded, if their counter is greater than k, or appended to the verification file otherwise. After the first critical moment and the first flushing, the scan of the dataset continues. For each object obj , SNIF computes the number of neighbors of obj by comparing it with all the memory-resident objects and also attempts to further increase this number by exploiting partitions. Whenever the memory again becomes full, the algorithm performs a new flushing. At the end of the dataset scan, SNIF discards both objects whose neighbor counter is at least k and objects belonging to good partitions. Next, the verification file is read. For each object obj in the verification file, if obj belongs to a good partition then it is discarded, otherwise obj is stored in memory. If the memory becomes full or the verification file is exhausted, a dataset scan is performed in order to classify all the memory-resident objects. At the end of the scan, if the verification file still contains objects the aforesaid process is repeated. REFERENCES AGGARWAL, C. C. AND YU, P. 2001. Outlier detection for high dimensional data. In Proceedings of the International Conference on Managment of Data (SIGMOD’01). ANGIULLI, F. AND FASSETTI, F. 2007. Very efficient mining of distance-based outliers. In Proceedings of the International Conference on Information and Management (CIKM), 791– 800. ANGIULLI, F. AND PIZZUTI, C. 2002. Fast outlier detection in large high-dimensional data sets. In Proceedings of the International Conference on Principles of Data Mining and Knowledge Discovery (PKDD’02), 15–26. ANGIULLI, F. AND PIZZUTI, C. 2005. Outlier mining in large high-dimensional data sets. IEEE Trans. Knowl. Data Eng. 2, 17, 203–215. ARNING, A., AGGARWAL, C., AND RAGHAVAN, P. 1996. A linear method for deviation detection in large databases. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD’96), 164–169. BARNETT, V. AND LEWIS, T. 1994. Outliers in Statistical Data. John Wiley & Sons. BAY, S. D. AND SCHWABACHER, M. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD’03). BECKMANN, N., KRIEGEL, H.-P., SCHNEIDER, R., AND SEEGER, B. 1990. The r*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the SIGMOD Conference, 322– 331. BENTLEY, J. 1975. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 509–517. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

4:56



F. Angiulli and F. Fassetti

BERCHTOLD, S., KEIM, D., AND KRIEGEL, H.-P. 1996. The x-tree: An index structure for highdimensional data. In Proceedings of the Conference on Very Large Databases (VLDB), 28–39. ¨ , C., BERCHTOLD, S., AND KEIM, D. 2001. Searching in high-dimensional spaces: Index strucBOHM tures for improving the performance of multimedia databases. ACM Comput. Surv. 33, 3, 322– 373. BREUNIG, M. M., KRIEGEL, H., NG, R., AND SANDER, J. 2000. Lof: Identifying density-based local outliers. In Proceedings of the International Conference on Managment of Data (SIGMOD’00). CHA´ VEZ, E., NAVARRO, G., BAEZA-YATES, R., AND MARROQU´ıN, J. 2001. Searching in metric spaces. ACM Comput. Surv. 33, 3, 273–321. CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1997. M-tree: An efficient access method for similarity search in metric spaces. In Proceedings of the International Conference on Very Large Databases (VLDB), 426–435. DAVIES, L. AND GATHER, U. 1989. The identification of multiple outliers. Tech. rep. 89/1, Department of Statistics, University of Dortmund. DAVIES, L. AND GATHER, U. 1993. The identification of multiple outliers. J. Amer. Statist. Assoc. 88, 782–792. ESKIN, E., ARNOLD, A., PRERAU, M., PORTNOY, L., AND STOLFO, S. 2002. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Applications of Data Mining in Computer Security. Kluwer. GHOTING, A., PARTHASARATHY, S., AND OTEY, M. 2006. Fast mining of distance-based outliers in high-dimensional datasets. In Proceedings of the SIAM International Conference on Data Mining (SDM’06). HAWKINS, D. 1980. Identification of Outliers. Monographs on Applied Probability and Statistics. Chapman & Hall. JIN, W., TUNG, A., AND HAN, J. 2001. Mining top-n local outliers in large databases. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’01). KNORR, E. AND NG, R. 1997. A unified approach for mining outliers. In Proceedings of the IBM Centre for Advanced Studies Conference (CASCON), 219–222. KNORR, E. AND NG, R. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Databases (VLDB’98), 392– 403. KNORR, E. AND NG, R. 1999. Finding intensional knowledge of distance-based outliers. In Proceedings of the International Conference on Very Large Databases (VLDB’99), 211– 222. KNORR, E., NG, R., AND TUCAKOV, V. 2000. Distance-based outlier: Algorithms and applications. VLDB J. 8, 3-4, 237–253. ¨ , L., KUMAR, V., OZGUR, A., AND SRIVASTAVA, J. 2003. A comparative study of LAZAREVIC, A., ERTOZ anomaly detection schemes in network intrusion detection. In Proceedings of the SIAM International Conference on Data Mining. PAPADIMITRIOU, S., KITAGAWA, H., GIBBONS, P., AND FALOUTSOS, C. 2003. Loci: Fast outlier detection using the local correlation integral. In Proceedings of the International Conference on Data Enginnering (ICDE), 315–326. RAMASWAMY, S., RASTOGI, R., AND SHIM, K. 2000. Efficient algorithms for mining outliers from large data sets. In Proceedings of the International Conference on Managment of Data (SIGMOD’00), 427–438. RIDER, P. 1962. The negative binomial distribution and the incomplete beta function. The Amer. Math. Monthly 69 , 4, 302–304. RUIZ, E. V. 1986. An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recogn. Lett. 4, 3, 145–157. SAMET, H. 2005. Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann. SCHULTZE, V. AND PAWLITSCHKO, J. 2002. The identification of outliers in exponential samples. Statistica Neerlandica 56, 1, 41–57. ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.

DOLPHIN: Algorithm for Mining Distance-Based Outliers



4:57

TAO, Y., XIAO, X., AND ZHOU, S. 2006. Mining distance-based outliers from large databases in any metric space. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD’06), 394–403. UHLMANN, J. K. 1991. Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett. 40, 4, 175–179. WATANABE, O. 2000. Simple sampling techniques for discovery science. TIEICE: IEICE Trans. Commun./Electron./Inf. Syst. Received February 2008; revised September 2008; accepted November 2008

ACM Transactions on Knowledge Discovery from Data, Vol. 3, No. 1, Article 4, Publication date: March 2009.