New algorithms for pattern-discovery and pattern ... - David Meredith's

0 downloads 0 Views 165KB Size Report
Apr 1, 2001 - of points) in a dataset is a member of exactly one TEC and the TEC to ... of the largest repeated patterns in a dataset the TEC that contains that ...
New algorithms for pattern-discovery and pattern-matching in multidimensional datasets David Meredith∗



Geraint A. Wiggins∗

Kjell Lemstr¨om∗†

City University, School of Informatics, Department of Computing, Northampton Square, London, EC1V 0HB, United Kingdom.



Department of Computer Science, FIN-00014 University of Helsinki, Finland.

{dave,geraint,kjell}@soi.city.ac.uk

April 1, 2001

1

2

CONTENTS

Contents 1 Introduction

3

2 What the algorithms do

6

2.1

2.2

SIATEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.1.2

Some preliminary concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.3

Step 1: Computation of inter-datapoint vectors . . . . . . . . . . . . . . . . . . . .

7

2.1.4

Step 2: Computation of largest pattern for each vector . . . . . . . . . . . . . . . .

8

2.1.5

Step 3: Computation of TECs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.1.6

Overall effect of SIATEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

SIA(M)E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.2.2

Step 1: Computation of inter-datapoint vectors . . . . . . . . . . . . . . . . . . . .

11

2.2.3

Step 2: Computation of largest pattern for each vector . . . . . . . . . . . . . . . .

12

2.2.4

Step 3: Sorting by size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.2.5

Overall effect of SIA(M)E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3 Summary

15

1 INTRODUCTION

1

3

Introduction

We have developed new algorithms for efficient and flexible pattern-matching and pattern-discovery in multidimensional datasets. (A multidimensional dataset is simply any set of points in an N -dimensional space.) These algorithms could be used as the basis of new commercially exploitable applications for data compression, information retrieval and data mining or structural analysis of data. The new algorithms are particularly appropriate for use with databases in which each item in the database is represented as a multidimensional dataset, as is the case, for example, in computer-based music libraries and databases of audio data, databases of 2- and 3-dimensional molecular structures, computer-based image and video libraries and collections of graphs representing scientific results or financial data. Algorithms already exist for data compression, information retrieval and structural analysis of data in domains such as the ones mentioned in the previous paragraph. However, most existing approaches in music information retrieval and bioinformatics and many algorithms in image-processing are based on string matching techniques that require the datasets to be represented as strings of characters before they are processed. In other words, most existing approaches attempt to process multidimensional numerical data using techniques originally designed for processing one-dimensional textual data. String-based approaches to processing multidimensional datasets are artificially limited as to the types of patterns that can be discovered and searched for; and certain information-retrieval tasks (such as, for example, searching for a polyphonic music query in a database of polyphonic music) are unnecessarily awkward to accomplish using these techniques.1 The pattern-matching and pattern-discovery algorithms that we have developed are radically different from other existing approaches in at least two ways. First, whereas other existing approaches are, for the most part, string-based, our approach is essentially geometrical. In our algorithms, the properties of multidimensional datasets are expressed naturally in geometrical terms using concepts such as vectors, points and geometrical transformations like translation. In other words, our algorithms process the multidimensional datasets directly using the mathematical concepts and theory that were originally developed for manipulating this kind of data. Second, most existing approaches to pattern-discovery and pattern-matching employ techniques based on the idea of trying to align a query pattern (e.g. a user-supplied regular expression) against the dataset at each possible position. We eschew alignment-based techniques in favour of a new and revolutionary datadriven approach based on the simple fact that if there exists a pattern P in a dataset that is translationally 1 For

an overview of string-matching techniques in general, see Crochemore and Rytter (1994). For an overview of string-

matching techniques applied to music information retrieval, see Crawford et al. (1998) and Lemstr¨ om (2000). For an introduction to pattern-matching techniques in bioinformatics, see Gusfield (1997)

4

1 INTRODUCTION β

(i)



d

(ii)

× b

α

γ



×

c



e

×

×

a

× d

(iii)

(iv)

×

β

b

◦ ×

β

c

e

× ◦ γ

a ◦ α×

b ◦ α×

×

◦ d

c

× ◦γ

×

e

×

a

× d

◦ ×

(v) b

×

c ◦ α×

a

(vi)

β

β

◦ γ

α d

e ◦ γ×

◦ ×



b

×

×

c

×

e

×

a

(vii) d



×

×

β

b

×

c

× a

e

◦ α×

γ



×

Figure 1: A simple-minded alignment approach to information retrieval.

invariant to a query pattern Q, then there will exist at least one query pattern datapoint q and one dataset point p such that the vector that maps q onto p is equal to the vector that maps Q onto P . The first step in both our pattern-matching and pattern-discovery algorithms is therefore to compute all the necessary inter-datapoint vectors (see sections 2.1 and 2.2 below). The advantages of our geometrical approach over the na¨ıve alignment methods can be illustrated using a simple example. Imagine that we are trying to find all the occurrences of the query pattern in Figure 1(i) in the dataset in Figure 1(ii). We first align the query pattern point α with the dataset point a (Figure 1(iii)) and find that two of the query pattern points can be aligned but the third cannot. There is therefore no match here. So we see what happens when we align α with b (Figure 1(iv))—again, no match. We continue until we’ve attempted to align α with all the points in the dataset (Figure 1(v),(vi) and

5

1 INTRODUCTION α 2× y 1

× ×

γ

β

×

δ

0 0

1

x

2

3

Figure 2: A query pattern. ×

3 2×

c

×

a

×

e

×

f

d

y ×

1

b

0 0

1

x

2

3

Figure 3: A dataset.

(vii)) and discover that there is only one position at which the query pattern can be successfully aligned (Figure 1(v)). If the dataset is represented as a binary search tree then this process can be carried out in a worst-case running time of O(mn log2 n) where m is the number of points in the query pattern and n is the number of points in the dataset. This time complexity arises from the fact that α has to be aligned with each of the n points in the dataset. For each of these alignments, we have to check for each of the m query pattern points whether or not it can be aligned with a dataset point. Each of these checks involves determining whether or not a datapoint is a member of the dataset which, if the dataset is represented as a binary search tree, can be achieved in log2 n time, giving an overall worst-case time-complexity of O(mn log2 n). This means that if we carry out the process with a query pattern containing c × m points, it will take c times as long as it does for a pattern containing m points. But the only thing that this process tells us is where the complete query pattern occurs within the dataset. Now let us imagine that we wish to know considerably more about the relationship between the query pattern and the dataset than can be discovered using a procedure like the one described in the previous paragraph. Let’s assume we have the query pattern shown in Figure 2 and the dataset shown in Figure 3 and we wish to know the best matches or alignments that can be achieved for this query pattern and dataset. If the reader examines Figures 2 and 3, he or she will discover that, in fact, there is no complete

2 WHAT THE ALGORITHMS DO

6

match for the query pattern in the dataset but that the set of query pattern points {α, β, γ} can be matched against the dataset patterns {a, b, d} and {c, d, f } and the set of query pattern points {β, γ, δ} can be matched against the dataset pattern {a, c, d}. These three partial matches are the “best” alignments that can be achieved between the query pattern and the dataset in this case. If we tried to discover this information using a procedure like the one described in the previous paragraph, it would not be sufficient to do as we did in Figure 1 and test only those alignments in which α is aligned with a dataset point. We would also have to test all those cases where β is aligned with a dataset point and all those cases where γ is aligned and so on. This gives an overall worst-case running time for this approach of O(nm 2 log2 n) if the dataset is stored as a binary search tree. In other words, finding the “best” matches using this procedure for a query pattern containing c × m points takes c2 times as long as finding them for a query pattern containing m points. However, by using our new, geometrical, vector-computation approach combined with some techniques for efficient data storage, we can accomplish the same task in a worst-case running time of O(mn)—that is, a time that for a given dataset increases linearly with the size of the query pattern.

2

What the algorithms do

In this section we provide a straight-forward description of two of our algorithms, one for pattern-discovery (SIATEC) and one for pattern-matching (SIA(M)E). More detailed explanations and descriptions can be obtained by contacting the authors.

2.1 2.1.1

SIATEC Introduction

SIATEC is an algorithm that discovers complete sets of translation-invariant patterns (which we call “translational equivalence classes” or TECs) in any multidimensional dataset. The algorithm takes a multidimensional dataset as input and generates as output a set of TECs. Two patterns in a dataset are said to be “translationally equivalent” if and only if one can be obtained from the other by the geometrical transformation of translation alone (i.e. no rotation, reflection or enlargement required). Each pattern (i.e. set of points) in a dataset is a member of exactly one TEC and the TEC to which a pattern belongs contains all and only those patterns to which the pattern is translationally equivalent. SIATEC generates for each of the largest repeated patterns in a dataset the TEC that contains that pattern. For a k-dimensional dataset of size n, SIATEC computes TECs with a worst-case time complexity of O(n 3 ). If TECs are not needed the time complexity reduces to O(n2 log2 n).

7

2 WHAT THE ALGORITHMS DO

×

3

b

× ×

2 y ×

1

a

×

e d

×

f

c

0 0

1

x

2

3

Figure 4: A simple two-dimensional dataset.

2.1.2

Some preliminary concepts

Let us assume that our multidimensional dataset, which we denote by D, consists of the six two-dimensional datapoints labelled a to f shown on the graph in Figure 4. Each of these datapoints can be represented as a pair of co-ordinates in the usual way so that, for example, the datapoint a can be represented by the ordered pair h1, 1i. The complete dataset D can therefore be represented as the set of ordered pairs D = {h1, 1i , h1, 3i , h2, 1i , h2, 2i , h2, 3i , h3, 2i} Given a datapoint d then the notation d[i] denotes the ith co-ordinate value of d. For example, the datapoint b in Figure 4 has the value h1, 3i therefore b[1] = 1 and b[2] = 3. We denote by |d| the number of co-ordinate values in d. Thus for any datapoint d in a K-dimensional dataset, |d| = K. Given any pair of datapoints, d1 , d2 , in a K-dimensional dataset, we define that d1 is less than d2 , denoted d1 < d2 , if and only if there exists a positive integer j ≤ K such that d1 [j] < d2 [j] and d1 [i] = d2 [i] for all positive integer values of i less than j. Thus, in Figure 4, a