Clustering maps - Semantic Scholar

27 downloads 862 Views 2MB Size Report
and my daily leader, Remko Tronçon, for given me valuable advice and tips pointing me in the ..... on a search engine will result in thousands of relevant web pages. Cluster analysis is ..... from becoming a global optimization. There exist some ...
FACULTY OF ENGINEERING THESIS SUBMITTED FOR THE PROGRAMME MASTER OF ARTIFICIAL INTELLIGENCE ACADEMIC YEAR 2005-2006

KATHOLIEKE UNIVERSITEIT LEUVEN

CLUSTERING MAPS

Wannes Meert Promotor : Prof. dr. ir. G. JANSSENS Daily leader: Remko Tron¸con

Acknowledgements I would like to express my thank to everyone who has helped me creating this thesis. First of all, I want to thank my parents for giving me the chance to take an extra year in Artificial Intelligence and their support. Also, some word of thanks goes to all my friends, especially De Bizons, who are too though. And I certainly do not want to forget my girlfriend, Bieke. Also, I would like to thank the people from the department of Computer Science: my promoter, prof. G. Janssens, for letting me create this thesis; and my daily leader, Remko Tron¸con, for given me valuable advice and tips pointing me in the right direction. Wannes Meert Leuven, may 2006

Contents 1 Introduction 1.1 Introduction 1.2 Purpose . . 1.3 Clustering . 1.4 Prototype . 1.5 Applications 1.6 Overview .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

1 1 1 1 2 2 2

2 Cluster analysis 2.1 Introduction . . . . . . . . . 2.2 Reasons for cluster analysis 2.3 What is a cluster? . . . . . 2.4 Types of cluster analysis . . 2.5 Proximity measure . . . . . 2.6 Summary . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

4 4 6 8 9 10 13

3 Clustering techniques 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 The basic K-means algorithm . . . . . . . . . . . 3.2.2 Time and space complexity . . . . . . . . . . . . 3.2.3 Strengths and weaknesses . . . . . . . . . . . . . 3.3 Hierarchical clustering . . . . . . . . . . . . . . . . . . . 3.3.1 Agglomerative Hierarchical Clustering Algorithm 3.3.2 Time and space complexity . . . . . . . . . . . . 3.3.3 Strengths and weaknesses . . . . . . . . . . . . . 3.4 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Principle . . . . . . . . . . . . . . . . . . . . . . 3.4.2 DBSCAN algorithm . . . . . . . . . . . . . . . . 3.4.3 Time and space complexity . . . . . . . . . . . . 3.4.4 Strengths and weaknesses . . . . . . . . . . . . . 3.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 K-means . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

14 14 15 15 17 18 19 20 23 24 25 25 25 27 27 28 28

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

i

CONTENTS

3.6

ii

3.5.2 Hierarchical clustering . . . . . . . . . . . . . . . . . . 3.5.3 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 DBSCAN and R-tree 4.1 Introduction . . . . . . . . . . . . . . . . 4.2 DBSCAN algorithm . . . . . . . . . . . 4.2.1 Density based notion of clusters . 4.2.2 The algorithm . . . . . . . . . . 4.3 R-tree index structure . . . . . . . . . . 4.3.1 Tree structure . . . . . . . . . . . 4.3.2 Searching the tree . . . . . . . . 4.3.3 Add a point to the tree . . . . . 4.4 Summary . . . . . . . . . . . . . . . . . 5 Web-based clustering 5.1 Introduction . . . . 5.2 Interface level . . . 5.3 Algorithmic level . 5.4 Index level . . . . . 5.5 Summary . . . . .

application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

30 30 31

. . . . . . . . .

32 32 32 33 36 39 39 41 41 42

. . . . .

43 43 43 45 47 47

6 Conclusions 49 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.2 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 A Source code

54

Chapter 1

Introduction 1.1

Introduction

When indicating interesting points on a map, one often gets high concentrations of points in certain regions of the map (typically around big cities). If an application provides such a map to a user, it should bear in mind that it is hard for a person to select exactly one point in such a concentrated zone. Therefore, these busy zones are typically grouped (by hand) into one big zone. The aim of this thesis is to develop a tool to automatically detect and group high-concentration zones on annotated maps to create a map which is easy to use.

1.2

Purpose

The main purpose of this thesis is to enhance the ease of use when searching a point on a map and clicking on it. The main focus is on the attempt to remove the clutter of markers in regions with a high concentration of markers. An example can be seen in Figure 1.1. The bulb-like markers indicate geographical locations of people. Clicking on a bulb gives more information about the person on that location. In the figure there are three cluttered regions where it is difficult to click on the right bulb.

1.3

Clustering

Cluttered markers occur in regions with high density markers. If we can locate these regions and their respective markers, the representation of these markers can be changed in something more user-friendly. It turns out that searching regions with lots of points can be done by means of clustering techniques. To find a clustering method suited for our problem a study is done of different clustering techniques. The best suited method has been investigated in detail and implemented to create a prototype. 1

CHAPTER 1. INTRODUCTION

2

Figure 1.1: World map with cluttered markers

1.4

Prototype

At the end of this thesis a prototype application is proposed. This prototype shows a map of the world with markers on interesting locations taking in account the cluttered markers. For this, clustering is used to replace the markers in high-density regions by a semi-transparant rectangle. It is possible to zoom in on such a rectangle to view the individual markers. An example is shown in Figure 1.2.

1.5

Applications

This technique for creating clutter-free maps can be used to display interesting point on a map of some geographical location, as stated before. Some examples are: showing all geographical locations of Jabber [Jabber Software Foundation, 2006] users, like on [Epigoon, 2006] [Butterfat LLC, 2006]; showing all restaurants in the United States of America [toEat.com, LLC, 2006] or indicating real estate in Great-Britain [OnOneMap Limited, 2006]. This shows that a wide variety of applications is possible.

1.6

Overview

This thesis is divided into five chapters.

CHAPTER 1. INTRODUCTION

3

Figure 1.2: World map with clustered markers Chapter 2 is an introduction to clustering. The concept of a cluster is not very well-defined and differs according to the problem to solve. Some characteristics are examined in detail to give an insight in those different clusters and clustering techniques. Chapter 3 is a study of three different clustering techniques: K-means clustering, Hierarchic clustering and Density-based clustering. A comparison between these methods is made to choose the best one for the stated problem in this thesis. Chapter 4 gives a study of our method of choice: DBSCAN. To enhance performance, an index structure has been used to do a spatial query on the points. This structure is the R-tree index structure. Chapter 5 explains the implementation of a prototype web application. This web application shows markers on a map taking in account the high density regions (clusters). The used technologies are Google Maps, Javascript, PHP and MySQL. Chapter 6 is a final conclusion of the work done in this thesis.

Chapter 2

Cluster analysis An intelligent being cannot treat every object it sees as a unique entity unlike anything else in the universe. It has to put objects in categories so that it may apply its hard-won knowledge about similar objects encountered in the past, to the object at hand. Steven Pinker, How the Mind Works, 1997

2.1

Introduction

As pointed out in Chapter 1, we want to find regions in a map with a high concentration of markers. In a general sense, this is analyzing two dimensional data, or even more generalized multivariate data. This chapter is an introduction into the techniques used to analyze these multivariate data. Especially, finding groups of objects within the data that posses the same characteristics. This is called cluster analysis or clustering, other names are also used in literature depending on the field of research: Q-analysis, topology, grouping, clumping, classification, numerical taxonomy and unsupervised pattern recognition. The idea of sorting similar things into categories is quite primitive since early humans must have been able to realize that many individual objects shared certain properties. Probably, they weren’t able to remember every animal or plant they have seen, but they knew which kind of animals were dangerous, which plants were poisonous, ... As humans are not able to memorize every detail of every object encountered, they attempt to group similar objects. In our mind a generalized and more compact version of all the objects in the world is created. This explains why it is possible to recognize a dog as a dog and not as a horse without having seen that particular dog before; it matches more to the group of dogs than to any other mental group in our mind. It is also possible to recognize a particular dog like your own dog, but this type of detailed recognition is only possible for a limited number of objects since our mind is not infinite. 4

CHAPTER 2. CLUSTER ANALYSIS

5

Cluster analysis can be seen as a basic human conceptual activity, but it is also fundamental to most branches of science. The best known and oldest example is probably the classification of organisms in biology, which is known as taxonomy. Animals and plants are hierarchical divided in different groups and subgroups. For example, man belongs to the group of primates, the mammals, the amniotes, the vertebrates and the animals. Note how in this classification, the higher the level of aggregation the less similar the members in that class are. Man has more in common with all other primates (e.g. apes) than it does with more ‘distant’ members of the mammals (e.g. dogs). Originally, this hierarchy wasn’t created in a purely mathematical way, but more in an artistic way. In the recent past, the computer has made it possible to process a huge amount of information and use a mathematical method to create such a hierarchy and clustering in general. As a result it is easy to search for similarities within data sets and new classification methods are developed in a variety of research domains. Unlike many other statistical procedures, cluster analysis methods are mostly used when we do not have any a priori hypotheses, but are still in the exploratory phase of the research. In a sense, cluster analysis finds the ‘most significant solution possible’. In this chapter, the concept of cluster analysis is explored. What is a cluster? What are some ideas to find clusters in a data set? The actual clustering algorithms based on these general ideas can be found in Chapter 3 where some of them are investigated in more detail. A simple preview of what clustering can do, is depicted in Figure 2.1. In this figure three groups of points are plotted in figure b and a clustering algorithm is used to find them resulting in figure b.

(a) Original points

(b) Clustered points

Figure 2.1: Clustering of points

CHAPTER 2. CLUSTER ANALYSIS

2.2

6

Reasons for cluster analysis

Cluster analysis for understanding Cluster analysis can be used to create conceptually meaningful groups of objects that share common characteristics. In this case a cluster scheme may represent an intelligent method for organizing a large data set so it can be understood easily, and data can be retrieved more efficiently. As stated previously, a human being uses some kind of cluster analysis to categorize the world that surrounds him to create a comprehensible scheme of that world. Humans divide the world into groups of objects (cluster analysis) and assign particular objects to these groups (classification) when encountering a new object. Some examples: • Biology. Since many years, biologists like Aristotle, Theophrastos or Linnaeus have created taxonomies of all living things. A lot of the early work in cluster analysis was intended to classify these living things, first by hand and mostly driven by creativity, later by purely mathematical methods. Recently, biologists apply clustering to the large amounts of genetic information that is available. • Search Engines. Internet is a gigantic source of information, a query on a search engine will result in thousands of relevant web pages. Cluster analysis is used to create groups of similar pages with respect to contents, author, geographical location, ... to enhance the presented information. These groups can be divided into subgroups creating a hierarchical structure. • Medicine. To understand and treat a disease, it has to be classified. In general, this classification will have two main aims: the first is prediction, separating diseases that require different treatment; the second is to provide a basis for research into aetiology, the causes of different types of diseases. • Psychiatry. Diseases of the mind are more difficult to find and differentiate than diseases of the body. There is an interest in psychiatry for using cluster analysis techniques to refine or even redefine current diagnostic categories. For example, cluster analysis has been used for the research of depressions, suicides and eating disorders. • Market research. Warehouses collect large amounts of data concerning the transactions of their customers. Cluster analysis can be used to identify different types of customers. This information can be used to specify the necessary marketing efforts.

CHAPTER 2. CLUSTER ANALYSIS

7

• Whisky tasting. If you like a particular malt whisky, what other kinds of whisky you might enjoy also? This is obvious a search for similar objects and therefore it can be done by means of clustering. This has been done in [Clustan Ltd., 2006] and [Lapointe and Legendre, 1994] to be used in breweries to create new distillation or by consumers to try new similar malts.

Cluster analysis for utility Cluster analysis creates an abstract representation of the original data. Some cluster techniques characterize each cluster in terms of a cluster prototype, a data object that is representative of the other objects in the cluster. These cluster prototypes can be used as the basis for a number of data analysis or data processing techniques. Some examples: • Summarization. Analyzing large data sets is a time-consuming operation. It would be useful to restrict this operation to a reduced data set containing only the most meaningful prototypes. Cluster analysis can be used to calculate these prototypes of a large data-set. The accuracy of the operation is a little less in comparison with the results when using the complete data set, but the results are comparable and notably faster. • Compression. Those cluster prototypes can also be used for data compression. A well-known example is vector quantization where a table is created with the prototypes and every object in the data-set is linked to one of those prototypes. This kind of compression is often applied to image, sound and video data, where many of the objects are highly similar to one another, some loss of information is acceptable and a substantial reduction of data size is desired. • Nearest Neighbors. When finding nearest neighbors, one has to compare the selected data object with all other objects in the dataset. Often, clusters and their prototypes can be found more efficiently and these prototypes can be used to calculate the distance between the selected data-points and the cluster prototypes. Objects within clusters far away can be neglected and only objects in clusters nearby should be calculated. More examples can be found in [Hartigan, 1975], [Jain and Dubes, 1988] or [Jain et al., 1999].

CHAPTER 2. CLUSTER ANALYSIS

2.3

8

What is a cluster?

The key idea of cluster analysis is grouping similar objects. But what defines this similarity? What is the definition of a cluster? [Bonner, 1964] suggested that the ultimate criterion for evaluating the meaning of a cluster is the value judgement of the user. If using a term such as cluster produces an answer of value to the investigator, that is all that is required. This is a valid argumentation, but not formal enough to many authors. However, as it turns out, a formal definition is not only difficult but may even be incorrect. It is not entirely clear how a ‘cluster’ is recognized by humans and these clusters can differ from person to person. One feature of the recognition process would appear to involve assessment of the relative distances between points. The interpretation of this relative distance is open to interpretation. Therefore, the definition of a cluster can be based upon different interpretations as can be seen in Figure 2.2.

Figure 2.2: Clusters with internal cohesion and/or external isolation Some types of clusters: • Well-separated. A cluster is a set of objects in which each object is closer (or more similar) to every other object in the cluster than to any object not in the cluster. This idealistic definition of a cluster is satisfied only when the data contain natural clusters that are quite far from each other. Well-separated clusters can have any shape, an example can be seen in Figure 2.3a. • Prototype-based. A cluster is a set of objects in which each object is closer (or more similar) to the prototype that defines the cluster than to the prototype of any other cluster. For data with continuous attributes, the prototype of a cluster is often a centroid (the mean of all points in the cluster). For data with categorical attributes a centroid is not meaningful and therefore the prototype is often a medoid (the most central point). Prototype-based clusters tend to be spherical as the prototype is the most central point, an example can be seen in Figure 2.3b. • Graph-based. If the data are represented as a graph, where the nodes are objects and the links represent connections among objects

CHAPTER 2. CLUSTER ANALYSIS

9

then a cluster can be defined as a connected component: a group of objects that are connected to one another, but have no connection to objects outside the group. An example can be seen in Figure 2.3c. • Density-based. A cluster is a dense region of objects that is surrounded by a region of low density. a density-based definition of a cluster is often employed when the clusters are irregular or intertwined and when noise and outliers are present. A density-based cluster can take on any shape, an example can be seen in Figure 2.3d. • Shared-Property (Conceptual Clusters). More generally, we can define a cluster as a set of objects that share a property. This definition encompasses all the previous definitions of a cluster. The process of finding such clusters is called conceptual clustering. When this conceptual clustering gets too sophisticated, it becomes pattern recognition on its own. Then this definition is no basic definition any more. The specific interpretation of clusters that a method uses to create these clusters can result in totally different mathematical approaches. It is important to decide which type of clusters are needed to solve a problem.

(a) Well-separated

(b) Prototype-based

(c) Graph-based

(d) Density-based

Figure 2.3: Types of clusters

2.4

Types of cluster analysis

When the wanted type of cluster is known, a suited method for extracting these clusters is needed. A variety of methods for searching clusters is available, each producing its own type of clusters. The way these methods work can be divided based on three characteristics. This defines not the

CHAPTER 2. CLUSTER ANALYSIS

10

type of clusters the method produces, but the principle how the data set is processed. These three characteristic are: • Hierarchical versus Partitional. The best known difference in clustering techniques is the distinction between hierarchical and partitional, whether it’s nested or not nested. A partitional clustering is a division of the data set into non-overlapping subsets (the clusters), such that each data object is in exactly one subset. This is also called iterative clustering because the clusters are rearranged until some local minimum is achieved. For hierarchical clustering clusters can contain subclusters to form a set of nested clusters organized as a tree. Each node in the tree is the union of its children. Hierarchical clustering can be seen as a sequence of partitional clusterings and a partitional clustering can be obtained by cutting the hierarchical tree at a particular level. • Exclusive versus Overlapping versus Fuzzy. If every data object belongs to at most one cluster, the clustering is exclusive. However, there are situations in which it is possible that a data object belongs to more than one cluster. For example, a book can belong to the category fiction and to the category English literature. In that case the clustering is non-exclusive or overlapping. This concept of overlapping can be taken further by using the idea of fuzzy sets. When working with fuzzy sets, an object belongs to every set, but with a membership weight between 0 and 1. Fuzzy cluster analysis means that every data object belongs to every cluster, but with a membership weight that indicates how good the data point belongs to that cluster. A fuzzy clustering can be converted into an exclusive clustering by assigning each data object to the cluster in which its membership weight is the highest. • Complete versus Partial. Complete clustering requires that every point is assigned to a cluster, whereas a partial clustering does not. A partial clustering can be useful when clustering a data set in which noise and outliers are present. When using complete clustering, these noise points would be added to a cluster and lower the accuracy of the cluster with respect to the real cluster points. For example when searching for centers of crowded areas we want to ignore people who live somewhere whole alone, they do not have to be ‘clustered’ to a big city.

2.5

Proximity measure

The most important characteristic for identifying clusters in a data set is the notion of relative distance. How close or similar are data objects to each

CHAPTER 2. CLUSTER ANALYSIS

11

other, or how far apart or dissimilar are they? The proximity of objects can be determined either directly or indirectly. In some applications the inter-object similarity or dissimilarity measure can be derived directly, for example in experiments where people are asked to judge the perceived similarity or dissimilarity of a set of objects of interest, like different tastes of whisky as in the example in Section 2.2. However, the similarity measure is mostly derived indirectly. In this case, the proximity is derived as the difference between two measurements on the object. For example, the distance between two cities is derived from their positions, or the distance between the length of two people is the difference between their two lengths. When using indirect proximity measures, different types of measures are possible. Important when defining the proximity measure one wants to use is the type of data. Simply subtracting two data-elements is possible when the data are continuous, but it is useless when the data are categorical. This is because categorical data have a limited set of possible values, not necessarily numerical. Some proximity measures are given below. It is dependent on the specific application and the type of data which proximity measure is best suited.

a. Continuous data For continuous variables, proximities between individuals are typically quantified by dissimilarity measures or distance measures, where a dissimilarity measure, δij , is termed a distance measure if it fulfils the metric (triangular) inequality δij + δim ≥ δjm (2.1) and the distance measures are non-negative. 1. Euclidian distance. This is probably the best known type of distance. It is simply the geometric distance in the n-dimensional space. The formula in a two dimensional space is distance(x, y) =

sX

(xi − yi )2

(2.2)

i

2. City-block (Manhattan) distance. This distance is the average difference across dimensions, comparable when walking from one point to another in a block-style city like Manhattan. distance(x, y) =

X i

|xi − yi |

(2.3)

CHAPTER 2. CLUSTER ANALYSIS

12

3. Chebychev distance. This distance measure may be appropriate in cases when one wants to define two objects as ‘different’ if they are different on any one of the dimensions. distance(x, y) = max |xi − yi |

(2.4)

4. Other distances. Can be found in, for example, [Everitt et al., 2001].

tan Dis

ce Distance

(a) Eucledian distance

(b) Manhattan distance

Distance

(c) Chebychev distance

Figure 2.4: Distance measures for continuous data

b. Categorical data The most common form of categorical data occurs when all the variables are binary. For this case, all the proximity measures are defined in terms of the entries in a cross-classification of the counts of matches and mismatches in the binary variables for two individuals:

Individual j

Individual i Outcome 1 0 1 a b 0 c d

Based on this table some distance measures or similarity measures can be defined: 1. Matching coefficient. sij = (a + d)/(a + b + c + d)

(2.5)

This similarity measure sees a zero-zero match also as a positive match when comparing a bit from two data objects. The absence of a property with both objects is also seen as a similarity. 2. Jaccard coefficient. sij = a/(a + b + c)

(2.6)

CHAPTER 2. CLUSTER ANALYSIS

13

The absence of a property is not always a similarity. If, for example, the presence or absence of a relatively rare attribute such as blood type AB negative is measured two individuals with that blood type are similar, but it is not clear whether the same can be said about two people who do not have the blood type. When the variables from categorical data have more than two levels, this can be dealt with in a similar way to binary data, with each level of a variable being regarded as a single binary variable. The similarity measure is an average over the needed bits: p

1X sij = sijk p k=1

(2.7)

where sijk is a binary similarity measure.

2.6

Summary

The purpose of cluster analysis techniques is processing data sets to summarize them into meaningful groups or clusters of objects which resemble each other and which are different in some aspects from the objects in other clusters. Lots of different methods have been developed to create such a clustering from a data set, based on the general ideas given in this chapter. Each of these methods has its advantages and its disadvantages. It is important to determine the kind of raw data and the purpose of the clustering to choose a cluster technique. Three types of clustering techniques are investigated in Chapter 3: K-means, hierarchical and density-based clustering.

Chapter 3

Clustering techniques The Milky Way is nothing else but a mass of innumerable stars planted together in clusters. Galileo Galilei, 1564-1642

3.1

Introduction

In the previous chapter, some general characteristics of clustering techniques have been introduced. In this chapter, we look at some concrete techniques. A wide variety of cluster algorithms have been proposed in the literature making it impossible to verify and test every method. Based on [Stein and Busch, 2005], we can differentiate three large groups of clustering techniques, which is shown in Figure 3.1. From every one of these groups, one method is implemented and tested. These methods are indicated in the figure by their red color. An additional group exists, being the metasearch-controlled methods. These methods are not clustering techniques by definition but could be used to find clusters. Hierarchical

Iterative Cluster approach Density-based

Meta-search-controlled

Agglomerative

Single-linkage, group average

Divisive

min-cut-analysis

Prototype-based

K-means, K-medoid

Combination-based

Kerninghan-Lin

Point concentration

DBSCAN

Cumulative attraction

MajorClust

Descend methods

Simulated annealing

Competitive

Genetic algorithm

Figure 3.1: The classical taxonomy of cluster algorithms The first algorithm is the K-means method [MacQueen, 1967]. 14

It is a

CHAPTER 3. CLUSTERING TECHNIQUES

15

prototype-based, partitional, iterative clustering technique and is one of the most famous clustering techniques. It is a powerful method but the basic algorithm is quite simple making it quite fast. The second algorithm is a hierarchical clustering method. Our focus is on the agglomerative type. This method creates a tree-like structure indicating the proximity of points to other points or to other subclusters. The last algorithm is the DBSCAN method [Ester et al., 1996], a densitybased, partial partitional clustering technique. This method is new with respect to the two other methods but uses a totally different technique for finding clusters since it looks at densities and not at individual point-to-point distances.

3.2

K-means

The K-means algorithm is a clustering technique that creates a one-level partitioning of the data objects. One-level because the data is simply divided in K clusters, no subclusters are defined. For each cluster a cluster prototype is defined: the centroid of the cluster, which is the mean of the data objects in that cluster. Related to the K-means algorithm is the K-medoid algorithm where the cluster prototype is the medoid of the data objects in the cluster, this is the most representative data object belonging to that cluster. The K-means algorithm is mostly applied to data objects in a continuous n-dimensional space, where it is easy to calculate a mean value for an object attribute. A mean object will almost never correspond to an actual data point. The K-medoid, on the other hand, can be used for most types of data since it only requires a proximity measure between objects and the cluster prototype is an actual data point. In this section the focus is set onto the Kmeans algorithm, which is one of the oldest and most widely used clustering algorithm.

3.2.1

The basic K-means algorithm

The K-means algorithm is described in Listing 3.1 and depicted in Figure 3.2. As input, the algorithm expects the number of clusters K it has to find. First, we choose K initial centroids, where K is the desired number of clusters. Each point is then assigned to the closest centroid, defining a cluster. The centroid of each cluster is then updated based on the points assigned to the cluster. This assigning and updating is repeated until no data point changes cluster anymore. K-means is an iterative algorithm, because points can be reassigned to another cluster during the process. A detailed look on each of the steps of the algorithm follows.

CHAPTER 3. CLUSTERING TECHNIQUES Listing 3.1: K-means algorithm 1 2 3 4

5 6

// input : t h e number o f c l u s t e r s , K S e l e c t K po int s as i n i t i a l c e n t r o i d s . while C e n t r o i d s do change . Form K c l u s t e r s by a s s i g n i n g each p o i n t t o i t s closest centroid . Recompute t h e c e n t r o i d o f each c l u s t e r . end

Mean of cluster Cluster

Mean of cluster Cluster

Mean of cluster Cluster

Mean of cluster Cluster

Mean of cluster Cluster

Figure 3.2: K-means algorithm with three clusters

16

CHAPTER 3. CLUSTERING TECHNIQUES

17

Select K points as initial centroids (line 2) First thing we must do, is to select K points as the initial centroids. The most simple approach to do this, is choosing K random centroids. Different runs of K-means, however, produce different results. As such, choosing good initial centroids will be the key step of the basic K-means algorithm. A technique used to overcome this problem is to perform multiple runs of the K-means algorithm and to select the one with the best results. This technique, however, is not perfect. Some other possibilities do exist. We can take, for example, a sample of points and cluster them using a hierarchical clustering technique (see Section 3.3) to define the initial centroids. Another option is to select first a random initial centroid and choose as every next point that point the most distant from every other centroid. This guarantees a well separated set of initial centroids. More advanced alternatives exist but they are too complex for this stage of our investigation. Assign each point to its closest centroid and recompute the centroid (line 4-5) To define what is the closest centroid, a proximity measure is needed. For data points in a 2D-plane, the Euclidian (L2 ) distance is often used. Different proximity measures exist for a variety of data types. Note that simple proximity measures are preferred as they are used repeatedly in the algorithm to compare pairs of data points. The proximity measure also defines the direction in which the centroid is moved. For example, the centroid is replaced to the center of the assigned points when using the L2 proximity measure. We can state that the proximity measure defines the goal objective of the clustering algorithm. Keep updating while centroids change (line 3) The updating step explained in the previous paragraph is repeated until no changes are made anymore. This constraint can be relaxed by defining a δ and changing the stopping criterion to ‘while change is larger than δ’. In this case, the results are less accurate, depending on δ, but less updating cycles are needed.

3.2.2

Time and space complexity

For the K-means algorithm both the data points and the centroids are stored. This is not very much data to store. The needed space requirements are: space requirement = O((m + K)n)

(3.1)

CHAPTER 3. CLUSTERING TECHNIQUES

18

with m = Number of data points K = Number of clusters n = Number of attributes The time complexity for K-means is also quite small. It is basically linear in the number of data points. time requirement = O(I.K.m.n)

(3.2)

with I = Number of iterations K = Number of clusters m = Number of data points n = Number of attributes I is the number of iterations needed for convergence. As most changes occur typically in the first few iterations, I can be safely bounded. Therefore, K-means is basically linear in the number of data points. This makes the K-means algorithm quite efficient, even though multiple runs are often performed to tackle the initialization problem. Some variants, like the bisection K-means, are more efficient and are less susceptible to initialization problems.

3.2.3

Strengths and weaknesses

Handling empty clusters When no points are assigned to a cluster during the assignment step, an empty cluster is created. If this happens, a replacement centroid has to be found. One approach is to choose the point that is the furthest away from the other centroids. Outliers and noise Outliers can strongly influence the K-means algorithm. A data point that is far away from every centroid has to be assigned to a cluster. When recalculating the centroid from the data points in the cluster, the new centroid can be less representative for the cluster than it had been without the outlier. It is important to identify outliers and to recognize that they are remarkable points (e.g. an unusual costumer), or that they should be eliminated (e.g. a wrong experimental result).

CHAPTER 3. CLUSTERING TECHNIQUES

19

Data types K-means can be used for a wide variety of data types. However, K-means is restricted to data for which there is a notion of a center (centroid). A related technique, K-medoid clustering, doesn’t have this restriction, but is more expensive. Cluster shape and subclusters K-means cannot handle non-spherical clusters or clusters of different sizes and densities. It is, however, possible to find some subclusters when a large number of clusters is defined, but these are all marked as clusters since the method creates a one-level partition. When dealing with clusters of very different sizes, the K-means method can fail because the method is prototypebased and distances are calculated from these prototypes. Therefore, the separation between two clusters lies in the middle of the two prototypes, not taking in account the magnitude of a cluster surrounding the prototype. Choosing initial centroids The success of the K-means algorithm is dependent on the initial centroids. For example, centroids are more likely to redistribute themselves within a region of higher density with respect to data points. A centroid has difficulties to pass over to a distant set of clusters if at least one centroid is in that other region. This is because no points of the distant set of clusters is assigned to a centroid from the first region. This situation is depicted in Figure 3.3. Three clusters are present in the picture and K-means has to find these three clusters. When two initial centroids start in the large upper cluster and one in the clusters beneath, the depicted situation is the result. The two upper centroids can not get out of the upper cluster because the distance between the two lower clusters is much smaller than the distance between the upper and lower cluster. This shows that good initial centroids are important for a successful clustering.

3.3

Hierarchical clustering

Hierarchical clustering does not partition the data objects in a particular number of clusters in a single step. Instead, the clustering consists of a series of partitions, creating a hierarchy of nested subclusters. Within the hierarchical clustering methods, there are two basic approaches: agglomerative and divisive methods. Agglomerative methods start with all the n points in a data set as individual clusters and, at each step, merge the closest pair of clusters. Divisive methods on the other hand start with one cluster containing all the data points. At each step, a cluster is split until only n singleton clusters remain. For merging and splitting clusters, a

CHAPTER 3. CLUSTERING TECHNIQUES

20

Mean of cluster Cluster

Figure 3.3: Result of K-means with bad initial centroids distance measure between clusters is needed. Some measures are given in the next section where we explain the different steps of the algorithm in detail. Agglomerative methods are more common, so the focus is set on this method. A hierarchical clustering is mostly displayed graphically using a tree-like diagram called a dendrogram, like the one shown in Figure 3.4. This dendrogram displays the cluster-subcluster relationships and the order in which the clusters are merged. Such a structure resembles an evolutionary tree, as it is in biological applications that hierarchical clustering finds its origin.

Distance

Figure 3.4: Dendrogram

3.3.1

Agglomerative Hierarchical Clustering Algorithm

As hierarchical clustering is mostly done in an agglomerative way, the focus will be on this algorithm, which is shown in Listing 3.2. A graphical repre-

CHAPTER 3. CLUSTERING TECHNIQUES

21

sentation of the different steps can be seen in Figure 3.5. If one-level clusters are needed like for the K-means algorithm, we can extract these from the dendrogram. One of the axes indicates the length bridged to merge two clusters. If this length is large compared with other lengths in the dendrogram a good splitting point has been found. Listing 3.2: Agglomerative hierarchic algorithm 1 2 3 4 5

6

A s s i g n each p o i n t t o i t s i n d i v i d u a l c l u s t e r Compute t h e p r o x i m i t y matrix while Number o f c l u s t e r s i s l a r g e r than one Merge t h e c l o s e s t two c l u s t e r s Update t h e p r o x i m i t y matrix t o r e f l e c t t h e p r o x i m i t y between t h e new c l u s t e r and t h e original clusters end

Assign each point to its individual cluster (line 1) In the initial state of the algorithm, each point of the n data points is assigned to its own individual cluster. In this initial state, n clusters are present in the system. Compute the proximity matrix (line 2) Based on the chosen proximity measure, a proximity matrix can be calculated. This matrix contains the distances between every pair of data objects in the data set. For example, suppose we have five data points and we use the Euclidian distance, then the following distance matrix is possible: 1 2 D1 = 3 4 5

      

0.0 2.0 6.0 10.0 9.0



0.0 5.0 0.0 9.0 4.0 0.0 8.0 5.0 3.0 0.0

     

(3.3)

Merge the closest clusters (line 4) To merge the two closest clusters it is necessary to define what is the distance between two clusters, based on the proximity matrix. Some methods are: • Single linkage The proximity of two clusters is defined as the minimum distance between any two points in the two different clusters. • Complete linkage The proximity of two clusters is defined as the maximum distance between any two points in the two different clusters.

CHAPTER 3. CLUSTERING TECHNIQUES

22

Distance

Distance

Distance

Distance

Distance

Distance

Distance

Distance

Distance

Distance

Distance

Figure 3.5: Hierarchical algorithm using single linkage

CHAPTER 3. CLUSTERING TECHNIQUES

23

• Average linkage The proximity of two clusters is defined as the average pairwise proximity among all pairs of points in the different clusters. • Ward’s method The proximity between two clusters is defined as the increase in the squared error that results when two clusters are merged. Assume we have selected the single linkage method and have proximity matrix D1 . In that case point 1 and 2 are the closest to each other and have to be merged. After this step, we can update the proximity matrix in the next step. Update the proximity matrix (line 5) In matrix D1 the links between point 1 and 2 have to be recalculated. For example, the new distance between point 3 and the new cluster (12) is d(12)3 = min[d13 , d23 ] = d23 = 5.0. The complete matrix becomes: (12) 3 D2 = 4 5

    

0.0 5.0 0.0 9.0 4.0 0.0 8.0 5.0 3.0 0.0

    

(3.4)

Until one cluster remains (line 3) When repeating the steps above a cluster disappears in every step until just one cluster remains. This can be seen in Figure 3.5.

3.3.2

Time and space complexity

This method uses a proximity matrix that has to be stored in memory, together with the clusters. This gives: 1 2 n 2 clusters = n − 1

proximity matrix =

(3.5) excluding singleton clusters

2

space complexity = O(n )

(3.6) (3.7)

with n = Number of data points The analysis of the basic agglomerative hierarchical clustering algorithm is also straightforward with respect to computational complexity. The first step is to compute the proximity matrix (line 2), which has a complexity of O(n2 ). After that the while-loop is executed n − 1 times because in every step two clusters are merged, until one cluster is left. Inside the while loop

CHAPTER 3. CLUSTERING TECHNIQUES

24

there are two actions: the first one is to merge the clusters (line 4) which requires O((n−i+1)2 ) time when using a linear search, which is proportional to the current number of clusters squared. The other action is to update the proximity matrix which has a complexity of O(n − i + 1). To conclude, the basic algorithm has the following time complexities: proximity matrix = O(n2 ) 3

Merging and updating = O(n )

(3.8) (3.9)

with n = Number of data points

3.3.3

Strengths and weaknesses

Merging decisions are final The hierarchical clustering algorithm tend to make good local decisions about combining two clusters since it can compare every pair of points in the data set. However, once a decision is made to merge two clusters, it cannot be undone at a later time. This approach prevents the local optimizations from becoming a global optimization. There exist some techniques for solving this limitation like reordering branches or using K-means to create many small clusters but these methods increases the complexity of the algorithm. No final clusters After the dendrogram has been created the wanted number of clusters has to be extracted. A number of techniques are possible like dividing the set in clusters of equal size. Another possibility is to look at the distances between nodes in the dendrogram and split the tree were large distances occur. As so, a hierarchical method is rather for creating a taxonomy. Outliers and noise Because this method clusters every point, outliers and noise have to be treated special. Like in the K-means algorithm we can delete them or realize that a lonesome point has some special value. Cluster shape and subclusters This method is very good in finding subclusters as it is hierarchic of nature. The shape of the cluster is dependent on the chosen linkage, so the cluster shape isn’t restricted to spherical forms.

CHAPTER 3. CLUSTERING TECHNIQUES

3.4

25

DBSCAN

DBSCAN is a density-based clustering algorithm, where clusters are regions of high density separated by regions of low density. The similarity measure in this type of method is based on density. More specific, the DBSCAN method uses a center-based density approach.

3.4.1

Principle

In the center-based approach, density is estimated for a particular point by counting al the data point within a radius . The density of each point will thus depend on the chosen radius. This method labels the data points as one of three possible classes: • Core points. These points are in the interior of a density-based cluster. A point is a core point if the number of points within a given neighborhood around the point as determined by  exceeds a certain threshold, M inP ts. • Border points. A border point is not a core point, but falls within the neighborhood of a core point. A border point can fall within the neighborhood of several core points, even from different clusters. • Noise points. A noise point is any point that is neither a core point nor a border point.

3.4.2

DBSCAN algorithm

The algorithm is given in Listing 3.3 and the steps are given graphically in Figure 3.6. Basically, the algorithm searches for points within a distance of  of each other. If their number exceeds M inP ts a cluster is created. Listing 3.3: DBSCAN algorithm 1 2 3 4 5

6 7

while Po int i s u n c l a s s i f i e d Find p o i n t s w i t h i n r e g i o n Eps i f number o f p o i n t s w i t h i n r e g i o n > MinPts S t a r t new c l u s t e r with Point S e a r c h r e g i o n s o f p o i n t s i n new c l u s t e r and expand c l u s t e r end end

CHAPTER 3. CLUSTERING TECHNIQUES

Point in queue Core point Border point Noise point

Point in queue Core point Border point Noise point

Point in queue Core point Border point Noise point

Point in queue Core point Border point Noise point

Point in queue Core point Border point Noise point

Figure 3.6: DBSCAN algorithm with MinPts=3

26

CHAPTER 3. CLUSTERING TECHNIQUES

3.4.3

27

Time and space complexity

The space requirement of DBSCAN is O(m), only a small amount of data has to be stored for a point. Only the cluster number and perhaps the classification as core, border or noise point. When executing the algorithm the time complexity is O(m× time to find points in the  neighborhood), where m is the number of data points in the data set. When performing a linear search to find the points in the neighborhood the complexity is O(m2 ), this is a worst case scenario. When using an index structure like the R-tree, the neighborhood search can be performed more efficiently. The complexity in this case is O(m log m).

3.4.4

Strengths and weaknesses

Parameters The DBSCAN uses two user-defined parameters: Eps and M inP ts. These parameters have a great influence on the performance of the algorithms. When choosing Eps too large, the whole data set will be one large cluster. When choosing Eps too small, every point will belong to its own singular cluster. Some methods exist to define good values for Eps and M inP ts and can be found in [Ester et al., 1996]. These two parameters are global parameters, which may be a problem. A variation on DBSCAN, where the density is cluster dependent, is MajorClust[Stein and Busch, 2005]. Clusters of varying density When the data set contains clusters of different density size the DBSCAN algorithm can have difficulties identifying the clusters. This is because of the two global parameters. Every region with a density higher than Eps will be a cluster and every region with a lower density not. Extended versions of DBSCAN exist who can distinguish regions of different density. However an extra iteration is necessary and will demand a longer execution time. Noise resistant Because of the density based criterion, the method is relatively resistant to noise as these points will be classified as noise points. Cluster shape The clusters can be of any shape and size within the same execution of the algorithm.

CHAPTER 3. CLUSTERING TECHNIQUES

28

High dimensionality In high dimensions (> 10 − 20), the underlying R-tree data structure degenerates to a linear search, which affects both runtime and classification performance.

3.5

Comparison

The stated problem in this thesis is the visualization of points on a 2D map. Basically we are searching for points whose markers overlap because then it is difficult for the user to click on the right marker. To put it in theoretical terms, we are searching for regions with a high density of data points. This comparison of methods is an important part of this thesis as we are searching for the best method to implement in our prototype. Most strengths and weaknesses have been given in the respective sections. In this section, some extra pointers are given with respect to our goal: creating a clutterfree prototype. Most important is the type of clusters we need to find. The shape of the cluster is not known in advance, nor the number of clusters. The only thing known is the distance between markers from when they start to overlap. Another important criterion is the speed of the method. Ideally, the clustering can be performed instantly to have a dynamic program.

3.5.1

K-means

In Figure 3.7 the K-means clustering algorithm has been used to find some clusters of points on the map. The basic algorithm has as user-defined parameter: the number of clusters. Four different cases have been tested: 2, 3, 4 and 5 clusters. An extra iteration on top of the K-means algorithm can be used to define an ideal number of clusters for the map [Hamerly and Elkan, 2003]. K-means tries to divide the data points into clusters not taking into account the inter-point distances in a cluster. The inter-point distances are dependent on the number of clusters and the overall distances. Take for example the case with two clusters. The left clusters consist clearly out of two easy to distinct parts but this distance is small enough to create a cluster because of the presence of the right cluster. The actual distance between points in a cluster is not easy to define for the K-means method and that is actually what we are looking for. Another problem is the restriction to spherical clusters, since our clusters can have any shape. Some techniques exist to loosen the form constraint, but there is always some type of form predefined. A last problem are the lonely points, a preprocess-method is needed to filter out the outliers because lonely points do not have to be clustered, they are not overlapping with any other point and so are easy to click on. The main advantage of this method is its speed. The K-means is the fastest method tested.

CHAPTER 3. CLUSTERING TECHNIQUES

29

(a) 2 clusters

(b) 3 clusters

(c) 4 clusters

(d) 5 clusters

Figure 3.7: Clusters from K-means

CHAPTER 3. CLUSTERING TECHNIQUES

3.5.2

30

Hierarchical clustering

The results for a hierarchic method are shown in Figure 3.8. The problems of this method are similar to those of the K-means method, but on top of that it is very slow because of the tree structure that is built. This tree structure has no advantage when solving the problem of overlapping markers and is a serious overhead.

(a) 2 clusters

(b) 3 clusters

(c) 4 clusters

(d) 5 clusters

Figure 3.8: Clusters from Hierarchic clustering

3.5.3

DBSCAN

The results of the DBSCAN algorithm are shown in Figure 3.9. This method uses a user-defined distance between points to create clusters. As can be seen in the figure this is comparable with what a human would do; markers that overlap are spotted and clustered. When the distance Eps is widened like in (a) the markers do not have to overlap but are simply in each others neighborhood.

CHAPTER 3. CLUSTERING TECHNIQUES

31

This method is slower than the K-means method, mainly because of the region-searches. But the resulted clusters have a good quality. DBSCAN provides a high flexibility with respect to cluster shapes, the number of clusters and the problem of noise detection. Therefore, the DBSCAN method is most suited for solving our clustering problem.

(a)  =

1 beeld 20

(b)  =

1 beeld 30

(c)  =

1 beeld 40

(d)  =

1 beeld 80

Figure 3.9: Clusters from Hierarchic clustering

3.6

Summary

In this chapter the three main methods known in cluster analysis have been evaluated: K-means, hierarchical clustering and DBSCAN. Each method has been analyzed on its advantages and disadvantages. The DBSCAN algorithm is the best method to solve our problem of cluttered markers on a map, mainly because of its density-based approach. In the next chapter the exact implementation of the DBSCAN method to solve the map cluttering problem is described in detail.

Chapter 4

DBSCAN and R-tree 4.1

Introduction

Based on the analysis in the previous chapter, the DBSCAN algorithm is most suited for our needs. This density-based clustering algorithm can be used to find crowded areas in a map, while not taking into account the outliers or needing a predefined number of clusters. The maximum distance between points in a high density region is a user defined parameter. A closer look at the theory and the implementation of this algorithm is taken in this chapter. The most time consuming part of the DBSCAN algorithm is the spatial search for points in the neighborhood. To speed this up, an index structure, the R-tree, is investigated and implemented. Both the DBSCAN algorithm and the R-tree method have been implemented by the author in Matlab [The Mathworks, Inc., 2006] for testing purposes.

4.2

DBSCAN algorithm

The DBSCAN algorithm was first proposed in [Ester et al., 1996]. The authors had developed a clustering algorithm for class indication in large spatial databases. This gives the following requirements for the algorithm: • Minimal requirements of domain knowledge to determine the input parameters, because appropriate values are often not known in advance when dealing with large databases. • Discovery of clusters with arbitrary shape, because the shape of clusters in spatial databases may be spherical, drawn-out, linear, elongated, etc. • Good efficiency on large databases, i.e. on databases of significantly more than just a few thousands objects.

32

CHAPTER 4. DBSCAN AND R-TREE

33

These also are the requirements needed for our prototype, and therefore, the DBSCAN method is used as the clustering algorithm in the prototype.

4.2.1

Density based notion of clusters

We can easily identify the clusters in Figure 4.1 based on the density of points within the cluster which is considerably higher than outside the cluster. In the following, we try to formalize this intuitive notion of ‘clusters’ and ‘noise’ in a database D of points in some k-dimensional space S.

Figure 4.1: Sample clusters Definition 1 ( neighborhood of a point) The -neighborhood of a point p, denoted by N (p), is defined by N (p) = {q ∈ D|dist(p, q) ≤ }. dist(p, q) is the distance function for two points p and q. We can not demand for every point in a cluster that there are at least a minimum number (M inP ts) of points in an -neighborhood of that point. This is because there are two kinds of points in a cluster: points inside the cluster (core points) and points on the border of the cluster (border points). In general, an -neighborhood of a border point contains significantly less points than an -neighborhood of a core point. A necessary demand for a border point is the presence of a core point. A border point only belongs to a cluster when it is directly density-reachable from a core point. Definition 2 (directly density-reachable) A point p is directly density-reachable from a point q wrt. , MinPts if 1) p ∈ N (q) and 2) |N (q)| ≤ M inP ts (core point condition). In this equation, |X| is the number of elements in X and the concept directly density-reachability is shown in Figure 4.2(b). An extension of directly density-reachable is density-reachable. Density-reachable points do not have to be directly connected but can be connected through a chain of connections as shown in Figure 4.2(c). Definition 3 (density-reachable) A point p is density-reachable from a point q wrt.  and MinPts if there is a chain of points p1 , ..., pn , p1 = q, pn = p such that pi+1 is directly density-reachable from pi .

CHAPTER 4. DBSCAN AND R-TREE

34

p

p

q

q

(a) p is a border point, q is a core point

(b) p is directly densityreachable from q, q is not directly density-reachable from p.

p p

q

o q

(c) p is density-reachable from q, q is not densityreachable from p

(d) p and q are densityconnected to each other by o

Figure 4.2: Density-reachability

CHAPTER 4. DBSCAN AND R-TREE

35

Two border points of the same cluster C are possibly not density-reachable from each other because the core point condition might not hold for both of them. Therefore, the notion of density-connectivity is introduced, which covers this relation of border points and is shown in Figure 4.2(d). Definition 4 (density-connected) A point p is density-connected to a point q wrt.  and MinPts if there is a point o such that both, p and q are densityreachable from o wrt.  and MinPts. Now, based on these definitions, it is possible to define a density-based notion of a cluster. Definition 5 (cluster) Let D be a database of points. A cluster C wrt.  and MinPts is a non-empty subset of D satisfying the following conditions: 1) ∀ p,q: if p ∈ C and q is density-reachable from p wrt.  and MinPts, then q ∈ C. (Maximality) 2) ∀ p, q ∈ C: p is density-connected to q wrt.  and MinPts. (Connectivity) Points not satisfying the cluster-criteria do not belong to a cluster and are noise points. Definition 6 (noise) Let C1 , ..., Ck be the clusters of the database D wrt. parameters i and M inP tsi , i = 1, ..., k. Then we define the noise as the set of points in the database D not belonging to any cluster Ci , i.e. noise = {p ∈ D|∀i : p ∈ / Ci }. The following lemmata are important for validating the correctness of the DBSCAN clustering algorithm. Lemma 1 Let p be a point in D and |N (p)| ≥ MinPts. Then the set O = {o|o ∈ D and o is density-reachable from p wrt.  and MinPts} is a cluster wrt.  and MinPts. Basically the lemma states that we can discover a cluster in a two-step approach. First choose an arbitrary point from the database satisfying the core point condition as a seed. Second, retrieve all points that are densityreachable from the seed obtaining the cluster containing the seed. Also, a cluster C contains exactly the points which are density-reachable from an arbitrary core point of C. This causes that a cluster is uniquely determined by any of its core points and the algorithm is stable equally which point is chosen as starting point. Lemma 2 Let C be a cluster wrt.  and MinPts and let p be any point in C with |N (p)| ≤ MinPts. Then C equals to the set O = {o|o is densityreachable from p wrt.  and MinPts}.

CHAPTER 4. DBSCAN AND R-TREE

4.2.2

36

The algorithm

To find a cluster, DBSCAN starts with an arbitrary point p and retrieves all points density-reachable from p wrt.  and M inP ts. If p is a core point, this method results in a cluster (Lemma 2). If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. This method is visualized in Figure 4.3. In the beginning the first point from the database is selected and the points in the neighborhood are searched. If the number of points is larger than M inP ts, the point is a core point and a new cluster is started with this first point as seed. The points in the neighborhood are added to the queue and investigated themselves. If a point in the queue does not succeed to the M inP ts-criteria it is considered to be a border point, otherwise it is a core point. When the queue is empty the cluster is complete and the next point in the database is taken. If it already belongs to a cluster, the point is skipped. If it succeeds to the M inP tscriteria a new cluster is started and when the M inP ts-criterium does not succeed the point is marked a noise point. A basic version of the DBSCAN algorithm is shown in Listing 4.1 and 4.2. Listing 4.1: DBSCAN algorithm 1 2 3 4 5 6 7 8 9

10 11 12 13

function DBSCAN( D ,  , M inpts ) // Output : function γ : D → N ClusterLabel = 1; f o r ∀p ∈ D γ(p) = ’UNCLASSIFIED ’ end f o r ∀p ∈ D i f γ(p) = ’UNCLASSIFIED ’ i f ExpandCluster ( D , p , C l u s t e r L a b e l ,  , M inP ts ) == t r u e C l u s t e r L a b e l ++; end end end

CHAPTER 4. DBSCAN AND R-TREE

37

Point in queue Core point Border point Noise point

Point in queue Core point Border point Noise point

Point in queue Core point Border point Noise point

Point in queue Core point Border point Noise point

Point in queue Core point Border point Noise point

Point in queue Core point Border point Noise point

Point in queue Core point Border point Noise point

Point in queue Core point Border point Noise point

Figure 4.3: DBSCAN algorithm with MinPts=3

CHAPTER 4. DBSCAN AND R-TREE

38

Listing 4.2: ExpandCluster algorithm 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

function ExpandCluster ( D , CurrentPoint , C l u s t e r L a b e l ,  , M inpts ) // Output : t r u e o r f a l s e Seeds = regionQuery (D , currentPoint , ) i f count ( S e e d s ) 0 P = Seeds . f i r s t ( ) R e s u l t = r e g i o n Q u e r y (D, p ,  ) i f count ( R e s u l t ) ≤ M inP ts f o r ∀ ResPoint ∈ P i f γ ( ResPoint ) ∈{ ’UNCLASSIFIED ’ , ’NOISE ’ } i f γ ( ResPoint ) = ’UNCLASSIFIED ’ S e e d s = S e e d s ∪{ ResPoint } end γ ( ResPoint ) = Label end end end S e e d s = S e e d s \p end return t r u e end The call to the function regionQuery returns the -neighborhood of point p. This query can be supported efficiently by a spatial access method such as an R-tree. This type of index structure is explained in the next section. The points which have been marked γ(p) = ‘NOISE’ may be changed later, if they are density-reachable from some other point in the database. This happens for border points of a cluster. Those points are not added to the seeds list because we know that a point with a label of ‘NOISE’ is not a core point. Adding those points to the seed list would only result in additional region queries which would yield no answer. If two clusters C1 and C2 are very close to each other, it might happen that some point p belongs to both, C1 and C2 . Then p must be a border point in both clusters. In this case p will be assigned to the cluster discovered first. Except from these rare situations, the result of the algorithm is independent

CHAPTER 4. DBSCAN AND R-TREE

39

of the order in which the points are visited due to Lemma 2.

4.3

R-tree index structure

As an extension of the ‘ubiquitous’ B-tree, Antonin Guttman proposed a method [Guttman, 1984] for indexing multi-dimensional spatial data: the R-tree. This method and its variations are still used today in all kinds of spatial data like CAD, multimedia and geographical databases. For more information about R-tree variations, see [Manolopoulos et al., 2003]. An R-tree is a height-balanced tree with index records in its leaf nodes containing points to data objects. The structure is designed so that a spatial search requires visiting only a small number of nodes. The index is completely dynamic; inserts and deletes can be intermixed with searches and no periodic reorganization is required. A schematic view is given in Figure 4.4.

4.3.1

Tree structure

A spatial database consists of a collection of tuples representing spatial objects. Leaf nodes in an R-tree contain a pointer Pi to such a database tuple and a rectangle I bounding the object area. This is depicted in Figure 4.4. Non-leaf nodes contain pointers to child (leaf) nodes and a rectangle which is the bounding box of its children. Let M be the maximum number of entries a node can contain and m ≤ M 2 the minimum. An R-tree satisfies the following properties • Every node contains between m and M pointers Pi unless it is the root. • Each leaf node contains a rectangle I which is the smallest rectangle that spatially contains the n-dimensional data object represented by that tuple. • Each non-leaf node contains a rectangle I which is the smallest rectangle that spatially contains the rectangles in the child nodes. • The root node has at least two children unless it is a leaf. • All leaves appear on the same level. The height of an R-tree containing N database object tuples is at most | logm N | − 1. The time needed for querying the index is dependent on m. A large m will reduce the height of the tree and improve space utilization. A too large m, however, will reduce the tree to a linear index which is what we had before indexing. A small m on the other hand causes a higher tree, partitioning the objects more. When choosing m too small, too many nodes have to be examined when searching the tree enlarging the needed time. A

CHAPTER 4. DBSCAN AND R-TREE

R1

R1

40

R2

Level 0

R2

R6

R7

Level 1

Level 2 R8

R9

R10

R11 R12

R13 R14

R15 R16

R17 R18 R19

Data tuples from database

(a) R-tree schematic view R1

R4 R5

R3 R9 Shape of Data Object

Level 0 Level 1

R14 R8

R10

Level 2 R9

R2

R13

R12

R7

R17

R6

R18 R16 R19 R15

(b) R-tree spatial view

Figure 4.4: Schematic view of the R-tree structure

CHAPTER 4. DBSCAN AND R-TREE

41

trade-off between a wide tree and a high tree has to be made. Therefore, the parameters m can be tuned as part of performance tuning.

4.3.2

Searching the tree

The search algorithm descends the tree from the root to the leaf nodes searching only those nodes whose rectangles overlap with the search region. The algorithm is given in Listing 4.3. The algorithm is called with as initial parameters: the root node as T and the search area as S. Listing 4.3: R-tree search algorithm 1 2 3 4 5

6 7 8 9 10 11 12 13 14

// input : s t a r t node T , s e a r c h a r e a S i f T i s not a l e a f for a l l p o i n t e r s E in T i f EI o v e r l a p s with S i n v o k e S e a r c h on t h e t r e e whose r o o t node i s p o i n t e d t o by Ep end end e l s e i f T i s a l e a f node for a l l p o i n t e r s E in T i f EI o v e r l a p s with S E i s a qualifying record end end end

4.3.3

Add a point to the tree

When adding a point to the tree, first, the best leaf node is selected and then the point is added. When the updated leaf node contains more than M pointers it has an overflow and the node must be split. This splitting of nodes propagates upwards as long as the updated nodes have more than M pointers. The algorithm can be seen in Listing 4.4. Listing 4.4: R-tree insert algorithm 1 2

3 4 5 6

// input : new r e c o r d E S e l e c t a l e a f node L i n which t o p l a c e t h e new r e c o r d E i f L has room f o r a n o t h e r p o i n t e r Install E else S p l i t t h e node t o o b t a i n L and LL c o n t a i n i n g E and a l l the old p o i n t e r s of L

CHAPTER 4. DBSCAN AND R-TREE

Propagate t h e s p l i t up i n t h e t r e e , e v e n t u a l l y grow t h e t r e e by s p l i t t i n g t h e r o o t node

7

8

42

end Most important in Listing 4.4 is the node splitting step. The division should be done in a way that makes it as unlikely as possible that both new nodes will need to be examined on subsequent searches. Since the decision whether to visit a node depends on whether its covering rectangle overlaps the search area, the covering area of the two covering rectangles after a split should be minimized. The most straightforward method is to try every possible division and calculate the area, but this is quite exhaustive. This method has an exponential cost. Faster (but with a accuracy trade-off) methods have been proposed like an algorithm with quadratic cost and one with a linear cost. Both have been implemented in our Matlab-version of the R-tree. A simple explanation is given: • Quadratic cost. Choose two objects as seeds for the two nodes, where these objects if put together create as much dead space as possible (dead space is the space that remains from the surrounding rectangle if the areas of the two objects are ignorde). Then, until there are no remaining objects, choose for insertion the object for which the difference of dead space if assigned to each of the two nodes is maximized, and insert it in the the node that requires smaller enlargement of its respective surrounding rectangle. • Linear cost. Choose two objects as seeds for the two nodes, where these objects are as furthest as possible. Then, consider each remaining object in a random order and assign it to the node requiring the smaller enlargement of its respective surrounding rectangle. The linear cost node split algorithm is proved to be as good as more expensive techniques [Guttman, 1984]. It is fast, and the lightly worse quality of the splits does not affect search performance noticeably.

4.4

Summary

In this chapter we have looked at our implementation of the DBSCAN algorithm. To enhance the speed of this clustering algorithm, an index structure, the R-tree, is examined and implemented. The R-tree structure is useful for indexing spatial data objects that have non-zero size as it delivers a gain in speed from O(n2 ) to O(n log n). These techniques can now be used as an underlying architecture for our prototype. The actual development of this prototype is described in the next chapter.

Chapter 5

Web-based clustering application 5.1

Introduction

The main purpose of this thesis is to avoid cluttered markers on a map. For this, we have implemented the DBSCAN clustering algorithm to locate regions of high marker-concentration (high density). Eventually, these clusters have to be linked to a visual interface. A detailed look on the architecture of the prototype is given in this chapter. How is the architecture of the prototype organized? It is possible to see three distinct levels in the created program. The interface level, the algorithmic level and the index level. The interface level contains the world map with the markers and clusters shown to the user. The points are loaded from an XML-file containing the individual points and the clusters. The creation of this file is done in the algorithmic level of the prototype. Essentially, this level is the DBSCAN algorithm to cluster the points. It has been shown that the bottleneck of the DBSCAN algorithm is searching for points in a points neighborhood. This action can be speeded up by using an index for the data points when querying the database. This is done in the index level of the architecture. An overview of the different levels is shown in Figure 5.1.

5.2

Interface level

Suppose a set of points is given with for each point the position and the dense region it belongs to if it is located in a dense region. In a naive way, it is possible to plot all the markers and color the markers according to the cluster they belong to. This is illustrated in Figure 5.2. However, this does not solve the problem of cluttered markers, it just indicates that there are points whose markers overlap. This visualizing problem has to be solved in 43

Index level

Algorithmic level

Interface level

CHAPTER 5. WEB-BASED CLUSTERING APPLICATION

4

44

1 MapView.html

if(conn SELEC WHERE print

DBSCAN