Temporal Mining Algorithms: Generalization and ... - Semantic Scholar

8 downloads 0 Views 454KB Size Report
or lifespan, this being the period between the first and the last time the item appears in transactions ... An example of such a pattern is that customers typically rent “Star Wars”, then. “Empire Strikes Back”, and then “Return of the Jedi”. Note that ...
Ben-Gurion University of the Negev Department of Computer Science

Temporal Mining Algorithms: Generalization and Performance Improvements

Thesis submitted as part of the requirements for the M.Sc. degree of Ben-Gurion University of the Negev by

Litvak Marina

The research work for this thesis has been carried out at Ben-Gurion University of the Negev under the supervision of Prof. Ehud Gudes

November 2004

2

Subject: Temporal Mining Algorithms: Generalization and Performance Improvements

This thesis is submitted as part of the requirements for the M.Sc. degree Written by: Marina Litvak Advisor: Prof. Ehud Gudes Department: Computer Science Faculty: Natural Sciences Ben-Gurion University of the Negev

Author signature:

Date:

Advisor signature:

Date:

Dept. Committee Chairman signature:

Date:

Contents

1 Introduction

6

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.2

Mining algorithms

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.3

Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.4

Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2 Scientific Background

11

2.1

Classification algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.2

Association rules as a basic data mining task . . . . . . . . . . . . . . . .

13

2.2.1

Apriori algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.2.2

Quantitative association rules . . . . . . . . . . . . . . . . . . . .

18

2.2.3

Clustering association rules . . . . . . . . . . . . . . . . . . . . .

26

Temporal mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.3.1

Sequential analysis sequential patterns . . . . . . . . . . . . . . .

38

2.3.2

Trend dependency mining[21] . . . . . . . . . . . . . . . . . . . .

50

2.3.3

Calendric association rules . . . . . . . . . . . . . . . . . . . . . .

53

2.3

3 Trend Dependency Mining

56

3.1

Our approach to solve TDMINE . . . . . . . . . . . . . . . . . . . . . . .

57

3.2

General trend dependencies discovery . . . . . . . . . . . . . . . . . . . .

59

3.2.1

59

Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2

CONTENTS 3.2.2 3.3

3.4

3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

Multi-Relational Trend Dependencies discovery (MRTD) . . . . . . . . .

63

3.3.1

Definitions and problem statement . . . . . . . . . . . . . . . . .

63

3.3.2

The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.3.3

Illustrative example of MRTD mining . . . . . . . . . . . . . . . .

71

Performance evaluation and experiments . . . . . . . . . . . . . . . . . .

76

3.4.1

The tested data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

3.4.2

Scale-up properties . . . . . . . . . . . . . . . . . . . . . . . . . .

77

4 Temporal Continuous Sequential Patterns Discovery

80

4.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

4.2

Definitions and problem statement . . . . . . . . . . . . . . . . . . . . .

82

4.3

The CTSPD algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

4.3.1

Candidate generation . . . . . . . . . . . . . . . . . . . . . . . . .

88

4.3.2

Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.3.3

Support counting . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

4.3.4

Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

4.3.5

The other approaches . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.4

The CSPADE algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.4.1

4.5

Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Performance evaluation and experiments . . . . . . . . . . . . . . . . . . 110 4.5.1

The tested data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.5.2

Scale-up properties . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.5.3

Relative performance . . . . . . . . . . . . . . . . . . . . . . . . . 112

5 Conclusions

113

CONTENTS

4

Abstract Temporal Mining Algorithms: Generalization and Performance Improvements

Data mining consists of finding interesting trends or patterns in large datasets, in order to guide decisions about future activities. There is a general expectation that data mining tools should be able to identify these patterns in the data with minimal user input. The patterns identified by such tools can give a data analyst useful and unexpected insights that can be more carefully investigated subsequently. The most commonly sought patterns are association rules, that identify a frequently occurring pattern of information in the database. In the first part of research we study the problem of mining clustered association rules. The clustered and the quantitative Association Rules are useful in the context of mining rules over quantitative attributes. Since data used in data mining algorithms is usually temporal, it is very important to discover correlations of attributes over several snapshots. Information like this may affect decisions made in different areas of the business world. We study such problems as: the problem of discovering trend dependencies in temporal data, and temporal sequences mining. The discovered dependencies can be useful for many applications, including: creating special packages of promotions and sales based on customers behavior prediction, creating compact statistical information, and more. In the second part of research we propose some new approaches for mining temporal rules, based on trend dependencies discovery. Several extensions of trend dependency mining algorithms are presented in this thesis, in particular the multi-relational trend dependency mining. Algorithms with proofs of correctness and completeness are given. We also change the definition of support for a trend dependency. The algorithm can be used for mining trend dependencies of different types with variable number of relations, thus it is more general than previous approaches.

CONTENTS

5

In the third part of research we introduce the problem of mining target events rules that are based on the discovery of continuous sequential patterns over temporal customeroriented datasets. Each transaction of such dataset consists of a set of events that are associated with a customer id and a timestamp. For each customer there are several transactions with different timestamps. One of the events is defined as the target event. We propose two algorithms, CTSPD and CSPADE, to discover continuous sequences, that lead to user-specified target. An experimental evaluation of the proposed algorithms is provided, and directions of future work are outlined.

Chapter 1 Introduction 1.1

Motivation

The amount of data kept in databases is growing at a phenomenal rate. At the same time, the users of this data are expecting more sophisticated information from it. A marketing manager is no longer satisfied with a simple listing of marketing contacts, but wants detailed information about customers’ past purchases as well as predictions of future purchases. Simple structured (query language) queries are not adequate to support these increased demands for information. Data mining helps to serve these needs. Data mining is often defined as finding hidden information in a database. Data mining access of a database differs from the traditional access in several ways, defined in [26]: 1. Query: The query might not be well formed or precisely stated. The data miner might not even be exactly sure of what he wants to see. 2. Data: The data accessed is usually a different version from that of the original operational database. The data have been integrated from different sources, cleansed and modified to better support the mining process. 6

Chapter 1. Introduction

7

3. Output: The output of the data mining query is probably not a subset of the database. Instead it is the output of some analysis of the contents of the database, extracting knowledge in the form of rules, patterns, classifications, etc. Data mining involves many different algorithms to accomplish different tasks. All of these algorithms can be characterized (as in [26] and [19]) as consisting of 3 parts: 1. Model: The purpose of the algorithm is to fit a model to the data. 2. Preference: Some criteria must be used to fit one model over another. 3. Search: All algorithms require some technique to search the data. The created model can be either predictive or descriptive in nature. A predictive model makes a prediction about values of data using known results found from some given data. This model data mining tasks includes classification, regression, time series analysis, and prediction. A typical example is a Bayes Knowledge where fixing the value of one node will predict the value of another node. A descriptive model identifies patterns or relationships in the data. Unlike the predictive model, it serves as a way to explore the properties of the data examined, not to predict new properties. Clustering, summarization, association rules, and sequence discovery are usually viewed as descriptive in nature. A typical example is an association rule, which identifies a relationship between two purchased itemsets.

1.2

Mining algorithms

Data Mining, as a process of inferring knowledge from a huge data, has three major tasks: Clustering or Classification, Association rules and Sequence Analysis ([16]). By a simple definition, in classification (or clustering) we analyze a set of data and generate a set of grouping rules which can be used to classify future data. For example,

Chapter 1. Introduction

8

one may classify diseases and provide the symptoms which describe each class or subclass. This has much in common with traditional work in statistics and machine learning. However, there are important new issues which arise because of the sheer size of the data. One of the important problem in data mining is the Classification-rule learning which involves finding rules that partition given data into predefined classes. In the data mining domain where millions of records and a large number of attributes are involved, the execution time of existing algorithms can become prohibitive, particularly in interactive applications. An Association Rule is a rule which implies certain association relationships among a set of objects in a dataset. In this process we discover a set of association rules at multiple levels of abstraction from the relevant set(s) of data in a database. For example, one may discover a set of symptoms often occurring together with certain kinds of diseases and further study the reasons behind them. Since finding interesting association rules in databases may disclose some useful patterns for decision support, selective marketing, financial forecast, medical diagnosis, and many other applications, it has attracted a lot of attention in recent data mining research. Mining association rules may require iterative scanning of large transaction or relational databases which is quite costly in processing. Therefore, efficient mining of association rules in transaction and/or relational databases has been studied substantially. This is discussed in detail in Section 2.2. In Sequential Analysis, we seek to discover patterns that occur in sequence. This deals with data that appear in separate transactions (as opposed to data that appear in the same transaction in the case of association). For example : “If a shopper buys item A in the first week of the month, then s/he buys item B in the second week, etc”. This is discussed in detail in Section 2.3. There are many proposed algorithms that try to address the above aspects of data mining. Compiling a list of all algorithms suggested/used for these problems is an arduous task. We have thus limited the focus of this thesis to list only some of the algorithms

Chapter 1. Introduction

9

that reference to our work. They are discussed in detail in Section 2.

1.3

Contributions of the thesis

The contributions of this thesis are in two areas. First in enhancing and generalizing some of the existing algorithms, and second in implementing the algorithms, integrating them within a general system, and conducting experimental evaluation of these algorithms. Specifically, in Trend Dependencies we developed and implemented the General Trend Dependencies Discovery. Also, we propose a Multi-Relational Trend Dependencies model to represent the numerical attribute evolutions. To prove the feasibility of our approach we implemented the proposed algorithms and tested them on some temporal customeroriented databases. In Temporal sequential association rules we introduce two new algorithms for continuous sequential patterns discovery, that given the specific customeroriented dataset find only continuous sequences of events that ended in one of the userspecified target events. All our algorithms were implemented in the Java programming language, integrated with our FlexMine system [10] and tested on real-life datasets.

1.4

Structure of the thesis

In this thesis we have investigated and experimented with several different techniques for mining association rules and frequent patterns. Each technique was enhanced with some of our own ideas and heuristics. All the techniques were implemented within the FlexMine system, and the experiments were done using this system and a set of real-life databases. In order to describe our results in a logical and comprehensive way, we chose to separate the experimental results according to the technique used. Thus the outline of the rest of the thesis is as follows: Chapter 2 presents the background, including some related work on association rules, temporal mining and sequential

Chapter 1. Introduction

10

patterns discovery. In Chapter 3 we discuss our general/enhanced Trend dependencies algorithm and the related experimental results. In Chapter 4 we discuss temporal sequential patterns and their results. It includes two sub-chapters, describing two proposed algorithms for discovering of continuous sequences: CTSPD and CSPADE, and their comparison.

Chapter 2 Scientific Background In this section we review several data mining approaches. We first discuss Classification (Chapter 2.1), then we discuss Association Rules and the Apriori algorithm (Section 2.2) and also we present Quantitative and Clustered Association Rules. Then we review Temporal Mining including the Sequential Patterns Discovery (Chapter 2.3).

2.1

Classification algorithms

In Data classification one develops a description or a model for each class in a database, based on the features present in a set of class-labeled training data. There are many data classification methods studied, including decision-tree methods, statistical methods, neural networks, rough sets, database-oriented methods etc. 1. Data Classification Methods [20]. Some of the machine-learning algorithms, like those introduced in [27], have been successfully applied in the initial stages of this field [22]. The authors of [27] introduce new algorithms and data structures for quick counting for machine learning datasets. They focus on the counting task of constructing contingency tables, but their approach is also applicable to counting the number of records in a dataset that match conjunctive queries. They provide a very sparse data structure, the ADtree, to minimize memory 11

Chapter 2. Scientific Background

12

use. Next we briefly discuss the main methods that are used for Data Mining. • Statistical Algorithms Statistical analysis systems such as SAS and SPSS, [8], have been used by analysts to detect unusual patterns and explain patterns using statistical models such as linear models. Such systems have their place and will continue to be used. • Neural Networks Artificial neural networks mimic the pattern-finding capacity of the human brain and hence some researchers have suggested applying Neural Network algorithms to pattern-mapping. Neural networks have been applied successfully in a few applications that involve classification, [28, 20]. • Genetic algorithms Optimization techniques that use genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution, [9, 20]. • Nearest neighbor method A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset. Sometimes called the k-nearest neighbor technique, [29, 20]. • Rule induction The extraction of useful if-then rules from data based on statistical significance, [29, 20]. • Data visualization The visual interpretation of complex relationships in multidimensional data, [35, 20]. 2. Data Abstraction [31, 20, 24]. Many existing algorithms suggest abstracting the test data before classifying it into various classes. There are several alternatives for doing abstraction before classification: A data set can be generalized to either a minimally generalized abstraction level, an

13

Chapter 2. Scientific Background

intermediate abstraction level, or a rather high abstraction level. Too low an abstraction level may result in scattered classes, bushy classification trees, and difficulty at concise semantic interpretation; whereas too high a level may result in the loss of classification accuracy. The generalization-based multi-level classification process has been implemented in the DB-Miner system. 3. Classification-rule learning Classification-rule learning involves finding rules or decision trees that partition given data into predefined classes, [26, 19, 20].

For any realistic problem domain of the

classification-rule learning, the set of possible decision trees is too large to be searched exhaustively. In fact, the computational complexity of finding an optimal classification decision tree is NP hard.

2.2

Association rules as a basic data mining task

One of the most known data mining tasks is the mining of association rules. An association rule (AR) is a model that identifies specific types of data associations ([3], [4]). These associations are often used in the retail sales community to identify items that are frequently purchased together. Example 2.1 illustrates the use of ARs in market basket analysis ([11]). Here the data analyzed consists of information about what items a customer purchases. ARs are also used in many other applications such as predicting the failure of telecommunication switches. Definition 2.1. Association Rule (AR) is the rule that express associative relationship between two itemsets and denoted by X ⇒ Y (Bread ⇒ Jelly). Definition 2.2. A itemset is a non-empty set of items. Example 2.1. A grocery store retailer is trying to decide whether to put bread on sale. To help determine the impact of this decision, the retailer generates association rules

Chapter 2. Scientific Background

14

that show what other products are frequently purchased with bread. He finds that 60% of the time that bread is sold so are pretzels and that 70% of the time jelly is also sold, that may be denoted: Bread ⇒ P retzels (with confidence = 0.6) and Bread ⇒ Jelly (with confidence = 0.7). Based on these facts, he tries to capitalize on the association between bread, pretzels and jelly by placing some pretzels and jelly at the end of the aisle where the bread is placed. In addition, he decides not to place either of these items on sale at the same time. Users of ARs must be cautioned that these are not causal relationships. They do not represent any relationship inherent in the actual data or in the real world. There is no guarantee that founded associations will apply in the future. However, ARs can be used to assist retail store management in effective advertising, marketing and inventory control. Two important concepts in AR are support and confidence ([3], [4]). Definition 2.3. AR X ⇒ Y has support s %, if at least s % of transactions in the database (DB) contain X and Y . Definition 2.4. AR X ⇒ Y has confidence c %, if at least c % of transactions in DB that contain X also contain Y . Authors of [33] introduce the problem of mining association rules in large relational databases containing both quantitative and categorical attributes. They deal with quantitative attributes by partitioning the values of the attribute to intervals and then combining adjacent intervals until combined support exceeds the user-specified maximum support parameter. They introduce a partial completeness measure that with other parameters defines the number of intervals for initial partition.

15

Chapter 2. Scientific Background

In [23] the problem of clustering two-dimensional association rules is considered. The authors present a geometric-based algorithm, BitOp, for performing the clustering. The algorithm is the approximation to optimal solution with factor O(H(n)) (that is close to O(logn)), where n is the minimal number of clusters. The approach to the same problem introduced in [40] differs from the previous ones. The authors introduce a new definition of quantitative association rules based on statistical inference theory. This definition reflects the intuition that the goal of association rules is to find extraordinary and therefore interesting phenomena in databases.

2.2.1

Apriori algorithm

An association rule mining algorithm, Apriori [19] has been first developed for rule mining in large transaction databases by IBM’s Quest project team. They have decomposed the problem of mining association rules into two parts : 1. Find all combinations of items that have transaction support above minimum support (min sup). Call those combinations frequent itemsets. 2. Use the frequent itemsets to generate the desired rules. The general idea is that if, say, ABCD and AB are frequent itemsets, then we can determine if the rule AB ⇒ CD holds by computing the ratio r =

support(ABCD) . support(AB)

The rule holds only if

r ≥ minimum confidence. Note that the rule has minimum support because ABCD is frequent. The Apriori algorithm used in Quest for finding all frequent itemsets is given below. The main target of Apriori is handling the exponential complexity of AR discovery algorithm: given n transactions and m different items the number of possible association rules is O(m2m−1 ) and computation complexity in case of scanning the dataset for each possible rule is O(nm2m ). The Apriori principle is based on support constraint that consists of two components: • If AB has support at least a, then both A and B have support at least a.

Chapter 2. Scientific Background

16

• Use patterns of n − 1 items to find patterns of n items. Guiding Principle: Every subset of a frequent itemset has to be frequent - used for pruning many candidates, that have any not frequent subset.

Apriori Principle In Figure 2.1 one can see how the number of dataset scans is reduced using the support-based pruning. If all subsets are considered than we need to scan the transactions 41 times (all subsets of size 1(6)+ all pairs (15)+ all triplets (20)) and using the pruning this number is reduced to 14. Algorithm 1 Apriori AprioriAlg() Lk−1 := all frequent 1-itemsets; for (k := 2; Lk−1 6= ∅; k + +) do { Ck =apriori-gen(Lk−1 ); // new candidates for all transactions t ∈ D do { for all candidates c ∈ Ck contained in t do c : count + +; } Lk = c ∈ Ck |c : count ≥ min supp } return Lk ;

As shown in Algorithm 1, Apriori makes multiple passes over the database D. In the first pass, the algorithm simply counts item occurrences to determine the frequent 1-itemsets (itemsets with 1 item). A subsequent pass, say pass k, consists of two phases. First, the frequent itemsets Lk−1 (the set of all frequent k − 1-itemsets) found in the k − 1th pass are used to generate the candidate itemsets Ck , using the apriori-gen() function. This function, as can be seen in Algorithm 2, first joins Lk−1 with Lk−1 , the

17

Chapter 2. Scientific Background

Minimum Support = 3 Item

Count

Bread

4

Coke

2

Milk

4

Beer

3

Diaper

4

Eggs

1

Items (1- itemsets)

⇓ Itemset

Count

{Bread, Milk}

3

{Bread, Beer}

2

{Bread, Diaper}

3

{Milk, Beer}

2

{Milk, Diaper}

3

{Milk, Diaper}

3

Pairs (2-itemsets)

⇓ Itemset

Count

{Bread, Milk, diaper}

3

{Milk, Diaper, Beer}

2

Triplets (3-itemsets)

⇓ ···

Figure 2.1: Apriori principle - illustrative example

Chapter 2. Scientific Background

18

joining condition being that the lexicographically ordered first k − 2 items are the same. Next, it deletes all those itemsets from the join result that have some k − 1-subset that is not in Lk−1 , yielding Ck . Algorithm 2 Candidate itemsets generation Apriori-gen(Lk−1 ) result = {} foreach is1 ∈ Lk−1 foreach is2 ∈ Lk−1 , is2 > is1 if (is1 .item1 == is2 .item1 and . . . and is1 .itemk−2 == is2 .itemk−2 ){ is = is1 .item1 , . . . , is1 .itemk−2 , is1 .itemk−1 , is2 .itemk−1 ; if (∀s(s ⊂ is ⇒ s ∈ is)) result = result ∪ is; return result

The algorithm now scans the database. For each transaction, it determines which of the candidates in Ck are contained in the transaction using a hash-tree data structure and increments the count of those candidates. At the end of the pass, Ck is examined to determine which of the candidates are frequent, yielding Lk . The algorithm terminates when Lk becomes empty.

2.2.2

Quantitative association rules

Association rules discover patterns and correlations that may be buried deep inside a database. Therefore they have become a key data-mining tool and as such have been well researched. Relational tables in most business and scientific domains have rich attribute types. Attributes can be quantitative (e.g. age, income) or categorical (e.g. zip code, make of car). Boolean attributes can be considered a special case of categorical

Chapter 2. Scientific Background

19

attributes. So, current solutions for this case are so far inadequate. The authors of [33] define the problem of mining association rules over quantitative and categorical attributes in large relational tables as the Quantitative Association Rules problem. An example of such an association might be “10% of married people between age 50 and 60 have at least 2 cars”. The authors deal with quantitative attributes by partitioning the values of the attribute into intervals and then combining adjacent partitions as necessary. They introduce measures of partial completeness which quantify the information lost due to partitioning. A direct application of this technique can generate too many similar rules. They tackle this problem by using a greater-thanexpected-value interest measure to identify the interesting rules in the output and give an algorithm for mining such quantitative association rules.

Definitions Here we want to introduce formal statement of the problem and relevant terminology. The details one can see in [33]. Let I = {i1 , i2 , . . . , im } be a set of literals, called attributes. Let P denote the set of positive integers. Let IV denote the set I × P . A pair hx, vi ∈ IV denotes the attribute x, with the associated value v. Let IR denote the set {hx, l, ui} ∈ I × P × P }|l ≤ u, if x is quantitative; l = u, if x is categorical}. Thus, a triple hx, l, ui ∈ IR denotes either a quantitative attribute x with a value in the interval [l, u], or a categorical attribute x with a value l. We will refer to this triple as an item. For any X ∈ IR , let attributes(X) denote the set {x|hx, l, ui ∈ X}. Note that with the above definition, only values are associated with categorical attributes, while both values and ranges may be associated with quantitative attributes. In other words, values of categorical attributes are not combined. Let D be a set of records, where each record R is a set of attribute values such that R ⊆ IV . We assume that each attribute occurs at most once in a record. We say that a record R supports X ⊆ IR , if ∀hx, l, ui ∈ X ∃hx, qi ∈ R such that l ≤ q ≤ u.

Chapter 2. Scientific Background

20

A quantitative association rule is an implication of the form X ⇒ Y , where X ⊂ IR , Y ⊂ IR , and attributes(X) ∩ attributes(Y ) = ∅. The rule X ⇒ Y holds in the record set D with confidence c if c% of records in D that support X also support Y . The rule X ⇒ Y has support s in the record set D if s% of records in D support X ∪ Y . Given a set of records D, the problem of mining quantitative association rules is to find all quantitative association rules that have support and confidence greater than the user-specified minimum support (called min sup) and minimum confidence (called min conf ) respectively. Note that the fact that items in a rule can be categorical or quantitative has been hidden in the definition of an association rule. We call X 0 a generalization of X (X a specialization of X 0 ) if attributes(X 0 ) = attributes(X) and ∀x ∈ attributes(X) [hx, l, ui ∈ X ∧ hx, b l, u bi ∈ X 0 ⇒ b l≤l≤u≤u b]. For example, the itemset {hAge : 30 . . . 39i, hM arried : Y esi} is a generalization of {hAge : 30 . . . 35i, hM arried : Y esi}. Partitioning Quantitative Attributes. There are two main difficulties of partitioning : to decide whether to partition a quantitative attribute or not and to compute how many partitions should be there. There are two basic problems associated with partitioning: • MinSup: If number of intervals is large then the support for any single interval can be low and therefore some rules may not be found. • MinConf : Large interval size usually implies low confidence for rules. Now we consider when we should partition the values of quantitative attributes into intervals, and how many partitions there should be. First, we present a measure of partial completeness which gives a handle on the amount of information lost by partitioning. The authors of [33] use the equi-depth partitioning — the method generates a number of intervals (bins) that contain roughly the same number of tuples. Then they show that equi-depth partitioning minimizes the number of intervals required to satisfy this

Chapter 2. Scientific Background

21

partial completeness level. Thus equi-depth partitioning is, in some sense, optimal for this measure of partial completeness. The intuition behind the partial completeness measure is as follows. Let R be the set of rules obtained by considering all ranges over the raw values of quantitative attributes. b be the set of rules obtained by considering all ranges over the partitions of quantiLet R b is to tative attributes. One way to measure the information loss when we go from R to R b is. The further away the closest see for each rule in R, how “far” the “closest” rule in R rule, the greater the loss. By defining “close” rules to be generalizations, and using the ratio of the support of the rules as a measure of how far apart the rules are, the authors of [33] derive the measure of partial completeness given below. Partial Completeness. R.Srikant and R.Agrawal, [33], first define partial completeness over itemsets rather than rules, since we can guarantee that a close itemset will be found whereas we cannot guarantee that a close rule will be found. They then show that we can guarantee that a close rule will be found if the minimum confidence level b is less than that for R by a certain (computable) amount. Let C denote the set of for R all frequent itemsets in D. For any K ≥ 1, a subset P of C is called K-complete with respect to C if: • P ⊆ C, b ∈X⇒X b ∈ P , and • X ∈ P and X • ∀ X ∈ C ∃ X 0 ∈ P such that i. X 0 is a generalization of X and support(X 0 ) ≤ K × support(X), ii. ∀ Y ⊆ X ∃ Y 0 ⊆ X 0 such that Y 0 is a generalization of Y and support(Y 0 ) ≤ K × support(Y ). The first two conditions ensure that P only contains frequent itemsets and that we can generate rules from P . The first part of the third condition says that for any itemset

22

Chapter 2. Scientific Background

in C, there is a generalization of that itemset with at most K times the support in P . The second part says that the property that the generalization has at most K times the support also holds for corresponding subsets of attributes in the itemset and its generalization. Notice that if K = 1, P becomes identical to C. For example, assume that in some table, the following are the frequent itemsets C: Number

Itemset

Support

1

{hAge : 20 . . . 30i}

5%

2

{hAge : 20 . . . 40i}

6%

3

{hAge : 20 . . . 50i}

8%

4

{hCars : 1 . . . 2i}

5%

5

{hCars : 1 . . . 3i}

6%

6

{hAge : 20 . . . 30i, hCars : 1 . . . 2i}

4%

7

{hAge : 20 . . . 40i, hCars : 1 . . . 3i}

5%

The itemsets 2, 3, 5 and 7 would form a 1.5-complete set, since for any itemset X, either 2, 3, 5 or 7 is a generalization whose support is at most 1.5 times the support of X. For instance, itemset 2 is a generalization of itemset 1, and the support of itemset 2 is 1.2 times the support of itemset 1. Itemsets 3, 5 and 7 do not form a 1.5-complete set because for itemset 1, the only generalization among 3, 5 and 7 is itemset 3, and the support of 3 is more than 1.5 times the support of 1. Lemma 2.1. Let P be a K-complete set w.r.t. C, the set of all frequent itemsets. Let RC be the set of rules generated from C, for a minimum confidence level minconf. Let RP be the set of rules generated from P with the minimum confidence set to minconf/K. b⇒B b ∈ RP such that Then for any rule A ⇒ B ∈ RC , there is a rule A b is a generalization of A, B b is a generalization of B, • A b⇒B b is at most K times the support of A ⇒ B, and • the support of A

Chapter 2. Scientific Background

23

b⇒B b is at least 1/K times, and at most K times the confidence • the confidence of A of A ⇒ B. Thus, given a set of frequent itemsets P which is K-complete w.r.t. the set of all frequent itemsets, the minimum confidence when generating rules from P must be set to 1/K times the desired level to guarantee that a close rule will be generated. In the example given earlier (see Table 2.2.2), itemsets 2, 3 and 5 form a 1.5-complete set. The rule hAge : 20 . . . 30i ⇒ hCars : 1 . . . 2i has 80% confidence, while the corresponding generalized rule hAge : 20 . . . 40i ⇒ hCars : 1 . . . 3i has 83.3% confidence. Determining the number of partitions. The authors use some proved properties of partial completeness to decide the number of intervals. Number of Intervals =

2×n m×(K−1)

where:

• n = Number of Quantitative Attributes • m = Minimum Support (as a fraction) • K = Partial Completeness Level If there are no rules with more than n e quantitative attributes, we can replace n with n e in the above formula. Algorithm In this paragraph we describe the basic phases of the Quantitative AR mining algorithm. Algorithm consists of the next steps below: Steps of QAR generation:[33] 1. Determine the number of partitions for each quantitative attribute (using equi-depth partition). 2. Mapping the partitioned data to discrete values (discretization). For categorical attributes, map the values of the attribute to a set of consecutive integers. For quantitative attributes that are not partitioned into intervals, the values are mapped to

Chapter 2. Scientific Background

24

consecutive integers such that the order of the values is preserved. If a quantitative attribute is partitioned into intervals, the intervals are mapped to consecutive integers such that their order is preserved.

3. Merging the adjacent intervals. Find the support for each value of both quantitative and categorical attributes. Additionally, for quantitative attributes, adjacent values are combined as long as their support is less than the user-specified max support. These values form the set of all frequent items. Next, find all sets of items whose support is greater than the user-specified minimum support. These are the frequent itemsets.

4. Rules generation. Use the frequent itemsets to generate association rules. The general idea is that if, say, ABCD and AB are frequent itemsets, then we can determine if the rule AB ⇒ CD holds by computing the ratio conf =

support(ABCD) . support(AB)

If conf ≥ min conf , then the rule holds. (The rule will have at least minimum support because ABCD is frequent.)

Example Consider the “People” table shown in Figure 2.2 (a). There are two quantitative attributes, Age and NumCars. Assume that in Step 1, we decided to partition Age into 4 intervals, as shown in Figure 2.2 (b). Conceptually, the table now looks as shown in Figure 2.2 (c). After mapping the intervals to consecutive integers, using the mapping in Figure 2.2 (d, e), the table looks as shown in Figure 2.2 (f). Assuming minimum support of 40% and minimum confidence of 50%, Figure 2.2 (g) shows some of the frequent itemsets, and Figure 2.3 some of the rules. We have replaced mapping numbers with the values in the original table in these two gures. Notice that the item hAge : [20, 29]i corresponds to a combination of the intervals [20, 24] and [25, 29], etc. We have not shown the step of determining the interesting rules in this example.

25

Chapter 2. Scientific Background People

Partitions for Age

RecId

Age

Married

NumCars

100

23

no

0

200

25

yes

1

300

29

no

1

400

34

yes

2

500

38

yes (a)

2

Interval [20, 24] [25, 29] [30, 34] [35, 39] (b)

After partitioning Age RecId

Age

Married

NumCars

100

[20, 24]

no

0

200

[25, 29]

yes

1

300

[25, 29]

no

1

400

[30, 34]

yes

2

500

[35, 39] (c)

yes

2

Mapping Age

Mapping Married

Interval

Integer

[20, 24]

1

Value

Integer

[25, 29]

2

yes

1

[30, 34]

3

no

2

[35, 39]

4 (d)

After mapping attributes RecId

Age

Married

NumCars

100

1

2

0

200

2

1

1

300

2

2

1

400

3

1

2

500

4

1

2

(e) FrequentItemsets: Sample Itemset

Support

{hAge : [20, 29]i}

3

{hAge : [30, 39]i}

2

{hM arried : yesi}

3

{hM arried : noi}

2

{hN umCars : [0, 1]i}

3

{hAge : [30, 39]i, hM arried : Y esi}

2

26

Chapter 2. Scientific Background Rules: Sample Rule

Support

Confidence

hAge : [30, 39] and hM arried : yesi ⇒ hN umCars : 2i

40%

100%

hAge : [20, 29]i ⇒ hN umCars : [0, 1]i

60%

66.6%

Figure 2.3: Discovered Rules Other approaches Authors of [40] introduce a new definition of quantitative association rules based on statistical inference theory. Their definition reflects the intuition that the goal of association rules is to find extraordinary and therefore interesting phenomena in databases.

2.2.3

Clustering association rules

Algorithm introduced in [23] performs association rule clustering in the two-dimensional space, where each axis represents one attribute from the database used on the left-hand size (LHS) of a rule. Generated rules are of the form A∧B ⇒ C where the LHS attributes (A and B) are quantitative and the RHS attribute (C) is categorical. Clustering approach is heuristic, based on the geometric properties of a two-dimensional grid, and produces an efficient linear time approximation to an optimal solution.

Definitions An attribute can be either categorical (for example, “zip code”, “hair color”, “make of car”) or non-categorical (“salary”, “age”, “interest rate”). Categorical attributes are those that have a finite number of possible values with no ordering amongst themselves. Non-categorical (or quantitative) attributes, do have an implicit ordering and can assume continuous values usually within a specified range. Let D be a database of tuples where each tuple is a set of attribute values, called items, of the form (attributei = valuei ). Because quantitative attributes will typically assume a wide range of values from their respective domains, the authors partition these attributes into intervals, called bins. In [23] only equi-width bins were considered. Equi-width method generates intervals of the

Chapter 2. Scientific Background

27

Figure 2.4: Sample grid with clustered association rules. same size. Other choices are possible, such as equi-depth bins (where each bin contains roughly the same number of tuples), or homogeneity-based bins (each bin is sized so that the tuples in the bin are uniformly distributed). Clustering, as defined here, is the combination of adjacent attributes values, or adjacent bins of attribute values. For example, clustering (Age = 40) and (Age = 41) results in (40 ≤ Age < 42). A clustered association rule is an expression of the form XC ⇒ YC . XC and YC are items of the form (Attribute = value) or (bini ≤ Attribute < bini+1 ), where bini denotes the lower bound for values in the ith bin. In [23] the authors consider the problem of clustering association rules of the form A ∧ B ⇒ C where the LHS attributes (A and B) are quantitative and the RHS attribute (C) is categorical. The RHS attribute could be quantitative, but would first require binning with the resulting bins then treated as categorical values. The authors define a segmentation as the collection of all the clustered association rules for a specific value C of the criterion attribute. Given a set of two-attribute association rules over binned data, they form a two-dimensional grid where each axis corresponds to one of the LHS attributes. On this grid we will plot, for a specific value of the RHS attribute, all of the corresponding association rules. An example of such a grid is shown in Figure 2.4. Our goal is to find the fewest number of clusters, shown as circles in the figure, that cover the association rules within this grid. These clusters represent our clustered association rules and define the segmentation.

Algorithm Figure 2.5 shows a high-level view of the entire system to compute the clustered association rules, which we implemented. The system is simpler than one introduced in [23]. While the source data is read, the attribute values are partitioned (by the binner) as described earlier. The association rule engine is a special-purpose

Chapter 2. Scientific Background

28

Record Data ↓ ↓ #of x/y-bins→

Binning the data ↓ array of binned data ↓

minsup,minconf→ Association Rules Discovery ↓ association rules ↓ Clustering ↓ ↓ Clustered Association Rules

Figure 2.5: Schema of Association Rule Clustering System algorithm that operates on the binned data. The minimum support is used along with the minimum confidence to generate the association rules. Once the association rules are discovered for a particular level of support and confidence, we then form a grid of only those rules that give us information about the group (RHS) we are segmenting. Then the BitOp algorithm is applied to this grid to form clusters of adjacent association rules in the grid. We now detail components of the schema shown in Figure 2.5. Binning Data. The binner reads in tuples from the database and replaces the tuples’ attribute values with their corresponding bin number. We first determine the bin numbers for each of the two (LHS) attributes, Ax and Ay . Using the corresponding bin numbers, binx and biny , we index into a 2D array where, for each binx , biny pair, we maintain the number of binx , biny tuples having each possible RHS attribute value, as well as the total number of binx , biny tuples. The size of the 2D array is nx ×ny ×(nseg +1)

Chapter 2. Scientific Background

29

where nx is the number of x-bins, ny is the number of y-bins, and nseg is the cardinality of the (RHS) segmentation attribute. The binning algorithm (Algorithm 3) receives the set D of transactions of the form (Ax , Ay , Aseg ), where Aseg is the segmentation criterion attribute appears on the RHS of the AR, the number of x-bins (nx ) and the number of y-bins (ny ) as input and outputs the structure BinArray[x, y, seg], containing aggregated tuple counts for the binned data. Algorithm 3 The Bining the data Binner(D, nx , ny ) hx = dom(Ax )/nx ; // width of the x-bins hy = dom(Ay )/ny ; // width of the y-bins while (D 6= ∅){ get next (ax , ay , aseg ) ∈ D; binx = ax /hx ; biny = ay /hy ; BinArray[binx , biny , binseg ] = BinArray[binx , biny binseg ] + 1; BinArray[binx , biny , T OT AL] = BinArray[binx , biny , T OT AL] + 1; }

Association rule engine. The authors of [23] describe an efficient algorithm for the special case of mining two-dimensional association rules using the data structure constructed by the binning process. Deriving association rules from the BinArray is straightforward. Let Gk be our RHS criterion attribute. Every cell in the BinArray can be represented by an association rule whose LHS values are the two bins that define the BinArray cell, and whose RHS value is Gk : (X = i) ∧ (Y = j) ⇒ Gk where (X = i) represents the range (binix ≤ X < bini+1 x ), and (Y = j) represents the i j th range (binjy ≤ Y < binj+1 x-attribute y ), and binx and biny are the lower bounds of the i

30

Chapter 2. Scientific Background bin and the j th y-attribute bin, respectively. The support for this rule is the confidence is

|(i, j, Gk )| |(i ,j)|

|(i, j, Gk )| N

and

where N is the total number of tuples in the source data,

|(i, j)| is the total number of tuples mapped into the BinArray at location (i, j), and |(i, j, Gk )| is the number of tuples mapped into the BinArray at location (i, j) with criterion attribute value Gk . To derive all the association rules for a given support and confidence threshold we need only check each of the occupied cells in the BinArray to see if the above conditions hold. If the thresholds are met, we output the pair (i, j) corresponding to the association rule on binned data as shown above. The algorithm is shown in Algorithm 4. It receives as input the BinArray computed from the binning component, the value Gk we are using as the criterion for segmentation, the min sup threshold (%), the min conf threshold (%), N - the total number of tuples in the source data, nx - the number of x-bins and ny - the number of y-bins. The output of this procedure is a set of pairs of bin numbers, (i, j), representing association rules of the form (X = i) ∧ (Y = j) ⇒ Gk . Algorithm 4 The Association Rule Generation GenAssocationRules() min sup count = N × min sup; /* Association rule generation from the binned data */ for(i = 1; i < nx ; i + +) for(j = 1; j < ny ; j + +) if ((BinArray[i, j, Gk ] ≥ min sup count) and (BinArray[i, j, Gk]/BinArray[i, j, T otal] > min conf )) Output (i, j)

Clustering. We begin by presenting a very simple example of the clustering problem to illustrate the idea. Consider the following four association rules where the RHS attribute “Group label” has value “A”:

31

Chapter 2. Scientific Background s7

70 − −80

s6

50 − −60

N

N

s5

40 − −50

N

N

s4

30 − −40

SALARY s3

20 − −30

s2

10 − −20

s1

below 10 38

39

40

41

42

43

a1

a2

a3

a4

a5

a6

AGE Figure 2.6: Grid representing the four association rules (Age = 40) ∧ (Salary = $42, 350) ⇒ (Group label = A) (Age = 41) ∧ (Salary = $57, 000) ⇒ (Group label = A) (Age = 41) ∧ (Salary = $48, 750) ⇒ (Group label = A) (Age = 40) ∧ (Salary = $52, 600) ⇒ (Group label = A) If the LHS Age bins are a1 , a2 , . . . , and the LHS Salary bins are s1 , s2 , . . . , (see Figure 2.6) then these rules are binned to form the corresponding binned association rules: (Age = a3 ) ∧ (Salary = s5 ) ⇒ (Group label = A) (Age = a4 ) ∧ (Salary = s6 ) ⇒ (Group label = A) (Age = a4 ) ∧ (Salary = s5 ) ⇒ (Group label = A) (Age = a3 ) ∧ (Salary = s6 ) ⇒ (Group label = A) We represent these four rules with the grid in Figure 2.6. All four of the original association rules are clustered together and subsumed by one rule: (a3 ≤ Age < a4 ) ∧ (s5 ≤ Salary < s6 ) ⇒ (Group label = A). Assuming the bin mappings shown in Figure 2.6, the final clustered rule output to the user is: (40 ≤ Age < 42) ∧ ($40, 000 ≤ Salary
) This approach is based on comparing tuples from both of timestamped tables and generating arrays of relations (>, , ). anArray ← ∅; //***PATTERN PHASE*** FOR s ∈ I1 FOR t ∈ I2 FOR A ∈ X ∪ Y IF (s(A) < t(A))

THEN q(A) := “ 00 ; anArray = anArray ∪ q; sort anArray by X, Y ; //***SORTING PHASE*** maxc ← 0; maxT D ← a dummy pattern; //***STATISTICS PHASE*** l1 ← 1; l2 ← 1; r1 ← 1; r2 ← 1; WHILE (l1 < i) x := anArray(l1 )[X]; WHILE (anArray(l2 )[X] == x) l2 := l2 + 1; l := l2 − l1 ; WHILE (r1 < l2 ) y := anArray(r2 )[Y ]; WHILE (anArray(r2 )[Y ] == y and r2 < l2 ) r2 := r2 + 1; r := r2 − r1 ; s := r/i; c := r/l; IF (s ≥ thr and c > maxc) THEN maxc := c; maxT D := anArray(r1 )[X ∪ Y ]; r1 := r2 ; l1 := l2 ;

52

Chapter 2. Scientific Background

53

above observations. The expression (SS#, =)(Rank, ) ⇒ (C, =, =), (A, )(B, , >, >)(B, >, =, 0 or 0 , >) is extreme, but (A, >, 0 /0 )2, 3],

[(A, )2, 3], [(A,

)2, 3], [(B, >)(C, >)2, 3], [(A, )(C, >)2, 3]} F P2

=

{[(A, )2, 3], [(A, )2, 3], [(A, )2, 3]} Figure 3.6: Frequent patternsets sets F P1 and F P2 , respectively All = {[(A, , >)2, 3], [(A, )2, 3](!), [(B, >, =)(C, >, >)2, 3], [(A, , >)2, 3]} Figure 3.7: Final set of Frequent Patternsets. According to the first approach, we discover the frequent patternsets for each differencetable. These sets (F P1 and F P2 , accordingly) are shown in Figure 3.6. The next thing we should do is to extend the sets and get the final set of frequent patternsets All (see Figure 3.7), using DimApriori algorithm (Algorithm 10). By ‘(!)’ we mark the sets which were discovered earlier via trend-matrix. In order to decide about candidates’ frequency, we intersect their id lists. According to the second approach, we do not calculate all frequent patternsets in each table, but only 1-patternsets (see chapter 3.3.2), and then run DimApriori on them. The appropriate FSPs (Frequent Single Patterns) look as shown in Figure 3.8. F SP1 and F SP2 are sets of patterns for DT1 and DT2 , respectively, and F SP is the result of the DimApriori. Now, we should perform one additional step — calculate the set of all the frequent patternsets (All) from F SP . It is presented in the same figure. Next and final phase — generating rules (TD) from All. There are several rules which we received for our dataset (see Figure 3.9). Note. We got the same properties on each TD, due to the equal support for each patternset. Of course, it is very unlikely that it would happen in real-world datasets.

Chapter 3. Trend Dependency Mining

75

F SP1 = {[(A, )1, 2, 3], [(C, >)2, 3]} F SP2 = {[(A, )2, 3]} F SP = {[(A, , >)2, 3]} All = {[(A, , >)2, 3], [(A, )2, 3](!), [(B, >, =)(C, >, >)2, 3], [(A, , >)2, 3]} Figure 3.8: Frequent Single Patterns and Final Frequent Patternsets, respectively.

(A, ) with support 2/3 and confidence 1, (A, ) ⇒ (B, >, =) with support 2/3 and confidence 1, (B, >, =) ⇒ (A, ) (the same), (B, >, =)(A, ), (B, block; } return matchCount;

4.4

The CSPADE algorithm

In this chapter we describe the basic phases of slightly modified SPADE algorithm that we call CSPADE (Continuous Sequential Patterns Descovery using Equivalence Classes). We made three main modifications to this algorithm: i. Using our definition of support, 4.14, ii. We added an additional criterion for pruning during the processing of a class, iii. Filtering the non-continuous patterns after discovering all frequent sequences. Recall that SPADE ([41]) works independently on equivalent classes which are specified by a common prefix. The main disadvantage of this algorithm is that partitioning into Equivalent Classes does not allow us to use our technique for generating only continuous patterns inside each class (the ConcatJoin and, in some cases, the ExpJoin do not work,

Chapter 4. Temporal Continuous Sequential Patterns Discovery

104

because the seed sequences belong to different classes). Also, we cannot prune all nonappropriate patterns during the class processing because they can produce appropriate candidates in the future (non-continuous patterns can form continuous). However, we can use the independency of classes and the fact, that each of them give patterns with the same prefix to limit the number of generated candidates. Definition 4.17. We say, that the sequential pattern l1 → l2 → · · · → lk , where ∃ i : li and li+1 are both either target or basic levels, or li contains target event as well as basic ones, have a non-continuous (or violated) structure. Else, the sequence have a valid structure and is called a valid sequence. Recall that sequential patterns in our domain cannot have a violated structure, defined above. We avoid creation of classes specified by prefix that has violated structure. This affects the way of creating new candidates, namely Temporal Join. We describe this later. Note that the structure based method of pruning still does not guarantee that we’ll get only frequent continuous patterns. So, after processing of all classes we filter all patterns which are inappropriate to be target rules. We describe all the new phases below. Since the first, third and last phases, called Sorting the database, Transformaton phase and Rules generation respectively, are equivalent to the respective phases in chapter 4.3, we omit these phases in the description. The phases describing partition into the Equivalent Classes and their processing do not differ from the respective phases of the original algorithm [41]. We briefly summarize them in the chapter 2.3.1. The other phases are:

2. The computation of the frequent 1-sequences We use a vertical database format (see Figure 4.12), where we maintain an id-list for each item. Each entry of the id-list is a hcid, tidi pair where the item occurs (cid is the customer id and tid is the time id). Note, that for the basic event we associate the start time and for the target the end time. Using our definition of support, 4.14, we do not calculate the fraction of

Chapter 4. Temporal Continuous Sequential Patterns Discovery

105

customers supporting the pattern, but the number of occurrences of this pattern in the database. Given the vertical id-list database, all frequent 1-sequences can be computed in a single database scan. For each database item, we scan its id-list, incrementing the support for each new entry encountered.

4. The computation of the frequent pairs We use horizontal format of the database as described in the chapter 2.3.1 and [41]. We create all possible pairs, except for the Event Atoms (see chapter 2.3.1) where at least one of the events is target. The reason is that such patterns will give non-appropriate candidates in the future due to their illegal structure and Temporal Join properties. We give more explanations in the paragraph 6.

5. The decomposition into prefix-based parent equivalence classes See chapter 2.3.1 and [41].

6. Processing the classes In addition to what is described in the chapter 2.3.1 and [41], we do not create new equivalent classes with a prefix which is a sequence with non-continuous structure, during the recursive application of θk . We can prove that such class will give us only non-continuous patterns that will not affect the whole process of sequence generation due to the independence of classes. In order to avoid the creation of such classes we need to exclude the patterns forming them, namely, the patterns with violated prefix (denote them violated patterns). It affects the frequent sequence enumeration inside a class, namely the Temporal Join (see chapter 2.3.1). We add several checkings to the Join 3. The candidates of joining P → A and P → B will be: P → AB if it is valid (see Definition 4.17), P → A → B if P → A is valid and P → B → A if P → B is valid. Now we want to prove, that these modifications in addition to constraints introduced in Paragraph 4.4 guarantee, that we’ll never receive violated sequences during processing the classes.

Chapter 4. Temporal Continuous Sequential Patterns Discovery

106

Lemma 4.4. Each sequence created during processing the classes is eiher • an Atom Event (see [41]) of the valid structure (by Definition 4.17) or • a Sequence Event ([41]) with the valid prefix. Proof. We’ll prove that by induction: 1. The claim is true for all sequences of size 2 (Recall, that we do not create Atom Events of violate structure at phase of pairs generation, 4.4). 2. Suppose, that it’s true for all sequences of size equals to k > 2. 3. Consider the sequence s of size n = k + 1. The sequence s was created by one of three ways: i. Join 1 (s is an Atom Event). New sequence was created from two sequences of size k: s0 and s00 , that both are valid by our assumption. If we’ll denote s0 by P X and s00 by P Y , where P is a prefix, then s will look like P XY . The prefix P is valid and ended by basic level, X and Y should be basic (by definition of valid structure, 4.17), so P XY is valid. ii. Join 2 (s is a Sequence Event). We get s = P X → Y from P X and P → Y , where P X is valid. iii. Join 3. s was created from two sequences of size k: P → X and P → Y , and may be one of three possibilities: P → XY if it is valid, P → X → Y if P → X is valid and P → Y → X if P → Y is valid.

We also use the support-based (see chapter 2.3.1) pruning during processing of classes and the Temporal Join, but in calculation of the support we do not check the continuity of a pattern, and therefore we may receive a value which is equal or bigger than the actual

Chapter 4. Temporal Continuous Sequential Patterns Discovery

107

support value, introduced in Definition 4.14 (denote this value by continuous support). This claim easy to prove by contradiction. Proof. Assume the existence of a sequence s, that have support value less that continuous support value. In the other words, number of occurrences of s in the dataset with any events between their levels (see Definition 4.4) is more than number of its occurrences in the same dataset without any events between. This conclusion leads to contradiction. Algorithm 22 Filtering of non-appropriate patterns L2 = {p}, p ∈ L2 and p is frequent and continuous; L ← L2 ; k ← 3; while(Lk 6= ∅) { for all p ∈ Lk { for all subs ⊂ p, |subs| = |p| − 1; if (subs ∈ / L) { remove p from Lk ; break; } } SupportCount(Lk ); L = Lk++ ; }

7. Filtering the non-relevant sequences As we already mentioned, after using the structure-based and support-based (see the chapter 2.3.1) pruning techniques during processing of classes, we still may receive non-frequent sequences. Before checking support and continuity for the discovered patterns via scanning the customer-sequences,

Chapter 4. Temporal Continuous Sequential Patterns Discovery cid

st time

end time

b items

target

1

10

14

CD

U

1

15

20

ABC

U

1

21

24

ABF

U

1

25

31

ACDF

U

2

15

19

ABF

U

2

20

25

E

U

3

10

15

ABF

U

4

10

15

DGH

U

4

20

24

BF

U

4

25

30

AGH

U

108

Figure 4.11: Original Input-Sequence Database we save time by pruning potentially non-appropriate patterns. We use the pruning technique described in chapter 4.3. It is clear that we cannot filter patterns of a class until we finish with all the classes (the subsequences of a pattern may belong to different classes). To start the checking, we filter F2 by scanning TD , to form the seed set for the initial iteration of the Pruning procedure. Then we go from the next upper set L3 to the last one and, at each iteration i, filter the patterns of set Li−2 by generating appropriate subsequences for each pattern (see the Pruning subchapter of chapter 4.3), and check if they belong to the previous set Li−1 . If the pattern was not pruned, we scan TD and check its frequency and continuity, using procedure Match, described earlier. The remaining patterns form the seed set for the next iteration. This is shown in Algorithm 22.

4.4.1

Example

Consider the input database shown in Figure 4.11. The database has 8 basic events, 4 customers (specified by a cid), and 10 events in all. Figure 4.12 shows all the frequent

Chapter 4. Temporal Continuous Sequential Patterns Discovery

A:

F:

CID

TID

1

15

1

21

1

25

2

15

3

10

4

25

B:

CID

TID

1

15

1

21

2

15

3

10

4

20

CID

TID

CID

TID

CID

TID

1

21

1

20

1

14

1

25

1

31

1

24

2

15

2

25

2

19

3

10

3

15

4

15

4

20

4

30

4

24

U:

U:

109

Figure 4.12: Id-lists for the Atoms

cid

(item,tid) pairs

1

(A 15) (A 21) (A 25) (B 15) (B 21) (F 21) (F 25) (U 20) (U 31) (U 14) (U 24)

2

(A 15) (B 15) (F 15) (U 25) (U 19)

3

(A 10) (B 10) (F 10) (U 15)

4

(A 25) (B 20) (F 20) (U 30) (U 15) (U 24) Figure 4.13: Vertical-to-Horizontal Database Recovery

Figure 4.14: Lattice Induced by Frequent 3-Sequences

Chapter 4. Temporal Continuous Sequential Patterns Discovery

110

Equivalent classes

Figure 4.15: Equivalence Classes induced by θ1 dataset

|D|

C

T

E

DB1

102100

11540

9

20

DB2

102100

11540

9

11

DB3

81700

11530

7

20

Figure 4.16: Tested Datasets Parameters events with a minimum support of 40% with their id-lists. The support of A is 60% and of B, F, U and U is 50%. In Figure 4.13 the horizontal format for the events is depicted. All frequent pairs and 3-sequences and lattice induced by them are seen in Figures 4.14. The equivalence classes, induced by theta1 are shown in Figure 4.15.

4.5

Performance evaluation and experiments

To compare the performance of our algorithms and represent their scale-up properties, we performed several experiments on a Dual Xeon workstation with a CPU clock rate of 2.4 GHz, 2 GB RAM, running Linux. The data is stored on Oracle 9i. The algorithms were implemented in Java and intergrated with the FlexMine system[10].

4.5.1

The tested data

We evaluated the performance of the algorithms over relative big real-life data (see 3.4.1). We vary such parameters as number of customers in the dataset (C), number of events

Figure 4.17: Scale-up of CTSPD

Chapter 4. Temporal Continuous Sequential Patterns Discovery

111

Figure 4.18: Scale-up of CSPADE

Figure 4.19: Number of Generated Rules versus Minimum Support per transaction (E), average number of transactions per customer (T ) or number of transactions at all (| D |). These parameters for the chosen 3 datasets (DB1, DB2 and DB3 respectively) are shown in Figure 4.16. DB3 is a 80% sample of DB1. In this paper we show the results of relative performance on DB1 and DB3.

4.5.2

Scale-up properties

In this chapter we present the results of scale-up experiments for both our algorithms. We tested the scale-up both as depending on the number of customers and on the number of events in a single transaction. Figures 4.17 and 4.18 represent the scale-up experiments for both algorithms with the number of customers increased from 3000 to 11500. The results are shown for the DB1. We keep all parameters of dataset constant and fixed the minimum support, that is 20%, 35% and 50% respectively. As shown, the execution time increased with the customers number increased. The CTSPD is faster and its execution time scale more uniformly. An execution time of CSPADE strongly increased on customer number equals 9500. We also made experiments with the number of events per transaction, that varies from 11 to 20. Both algorithms show similar scale-up properties for both cases — an execution time increased with the number of events/generated rules increased. Also, we got experimental results, represented in Figure 4.19, that demonstrate increasing the number of generated rules (of course, the same number for both algorithms) with the support decreased.

Figure 4.20: Relative Performance : Minimum Support

Chapter 4. Temporal Continuous Sequential Patterns Discovery

112

Figure 4.21: Relative Performance : Number of Rules

4.5.3

Relative performance

Figure 4.20 presents the relative execution times for both algorithms and three datasets defined in chapter 4.5.1. CSPADE is slower than CTSPD, because it generates a lot of candidates which are filtered at the end of process. The results are shown for the three defined datasets with fixed minimum support, from 20% to 60%. As expected, the execution time increased as the minimum support decreased due to the increased number of generated rules. Also, we present the relative execution times as a function of the actual number of generated rules in Figure 4.21. These results demonstrate that the execution time scales quite linearly.

Chapter 5 Conclusions This research deals with the topic of Temporal data mining. In our thesis a new algorithm for General Trend dependencies was proposed and extended to variable number of snapshots. Both algorithms are based on the Apriori principle. The second one uses Apriori in two-directional manner: for extension of patternsets size and patternsets dimension. We proved that Apriori principle holds in both cases. The extended algorithm, MRTD, was implemented as part of the FlexMine system ([10]) and tested on real-life datasets with variable values of input parameters. Also, a new problem definition was introduced for target rules mining, and two algorithms for solving it were proposed. Both algorithms, CTSPD and CSPADE, mine target sequential patterns, that have to be continuous, over the temporal database of customer transactions. The first algorithm, CTSPD, is based on the Apriori principle, which was adapted to our special application domain and considers only continuous candidates. The second algorithm, CSPADE, is based on SPADE which is adapted to our domain too. We modify it so that the equivalent classes that never produce appropriate rules, will not be created during processing. Although, non-continuous patterns may still be created and filtered at the end of the processing. 113

Chapter 5. Conclusions

114

We presented the relative performance and scale-up experiments for both algorithms, using real-life data. As expected, CTSPD was faster than CSPADE, which generated many more candidates. Both algorithms have comparable scale-up properties. The derived rules are used for improving the prediction of customers product upgrade that was introduced in [12]. After presenting the prediction experiments on our rules, our results were better than any of those reported in [12]. The experiment was conducted in the following way. The dataset of about 8000 customers was generated and partitioned into training set (containing ≈ 30% of the population) and test set (≈ 70%). Different feature sets were used as input: all the features, a subset chosen by the feature selection, 3 randomly chosen features, and 3 feautures selected by a-priory intuition. Also, the abstraction (splitting the data to the intervals) was used. In order to use sequential rules to predict upgrades we used algorithm, that chooses a subset of the rules greedily until no improvement in the predictive ability is gained. The vast majority of the founded during the tests rules had ’DID UPGRADE = No’ on the right side, and therefore were not very interesting for us. Nevertheless, we left all the rules in, for purpose of prediction. The predicted value for each instance in the test set was the value of the right side of the rule. As result of experiment, the lagest value of Errorness was 15%, ErrornessUpgrade 51% and ErrornessNoUpgrade 15%, where Errorness - the percent of test set for which the mistaken prediction was made (incorrectly classified instances), ErrornessUpgrade the percent of test set that did upgrade but were predicted mistakenly and ErrornessNoUpgrade - the percent of test set that didn’t upgrade but were predicted mistakenly. In the future, we would like to apply our sequential rules to other types of Network Mining. It can be useful in many research areas, like predicting a new event in a customer’s activity or detecting and preventing the terrorists’ activity in the Internet. Also, we would like to introduce a matching measure as an additional satisfaction characteristic to the TD rules for fuzzy rules mining. Finally, we found that the problem of extending the Clustered AR’s to the n-sized

Chapter 5. Conclusions

115

LHS for n > 2, is an interesting geometric optimization problem, which can be further investigated.

Bibliography [1] J. Adamo. Data Mining for Associaton Rules and Sequential Patterns. Springer-Verlag new York, 2001. [2] C. C. Aggarwal, C. M. Procopiuc, and P. S. Yu. Finding localized associations in market basket data. Knowledge and Data Engineering, 14(1):51–62, 2002. [3] R. Agrawal, T. Imieliski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of the 1993 ACM SIGMOD, pages 207–216, Washington, D.C., United States, 1993. [4] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th Int’l Conference on Very Large Databases, Santiago, Chile, September 1994. [5] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. of the ICDE, Taipei, Taiwan, March 1995. [6] J. M. Ale and G. H. Rossi. The itemset’s lifespan approach to discovering general temporal association rules. In The Second Workshop on Temporal Data Mining, Edmonton, Alberta, CANADA, July 2002. [7] Antunes and Oliveira. Temporal data mining: an overview. In KDD 2001 Workshop on Temporal Data Mining, 7th ACM SIGKDD, San Francisco, CA, USA, August 2001. [8] H.

Arsham.

Computational

statistics

with

applications.

http://home.ubalt.edu/ntsbarsh/Business-stat/stat-data/SPSSSAS.htm. [9] T. Back. Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford University Press, New York, 1996.

116

117

Bibliography

[10] R. Ben-Eliyahu-Zohary, C. Domshlak, E. Gudes, N. Liusternik, A. Meisels, T. Rosen, and S. E. Shimony. Fleximine - a flexible platform for kdd research and application development. Annals of Mathematics and Artificial Intelligence, 39(1-2):175–204, 2003. [11] S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In Proc. ACM SIGMOD, pages 265–276, Tucson, Arizona, USA, May 1997. [12] A. Budker, E. Gudes, T. Hildeshaim, M. Litvak, E. Shimony, L. Amit, S. Meltzin, and G. Solotorevsky. Targeting customers by mining usage time-series. In CS/Stat’03 Second Haifa Winter Workshop on Computer Science and Statistics, Haifa, Israel, December 2003. [13] Dan R.Greening. Data mining on the web. Web Techniques, 2000. [14] Etzion, Opher, Jajodia, Sushil, Sripada, and Suryanarayana, editors. Temporal Databases: Research and Practice, volume 1399 of Lecture Notes in Computer Science, chapter From Data Mining to Knowledge Discovery: An Overview, page 429. Softcover, 1998. [15] P. Fabris. Cover story: Data mining. CIO, May 15 1998. [16] Fayyad, Piatetsky-Shapiro, and Smyth. Advances in Knowledge Discovery and Data Mining, chapter From Data Mining to Knowledge Discovery: An Overview, pages 1–34. AAAI Press / The MIT Press, Menlo Park, CA, 1996. [17] Gary M. Weiss.

Mining predictive patterns in sequences of events.

In Proc. of

AAAI/GECCO Workshop on Data Mining with Evolutionary Algorithms: Research Directions, 1999. [18] M. Gavrilov, D. Anguelov, P. Indyk, and R. Motwani. Mining the stock market: Which measure is best? (extended abstract). [19] J.Han and M.Kamber. Data Mining: Concepts and Techniques. Academic Press, 2001. ˜ [20] K. P. Joshi. Analysis of data mining algorithms. http://userpages.umbc.edu/kjoshi1/datamine/proj rpt.htm, 1997. [21] J.Wijsen and R.Meersman. On the complexity of mining temporal trends. In The ACM SIGMOD Int. Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 77–84, 1997.

Bibliography

118

[22] R. Kohavi, D. Sommerfield, and J. Dougherty. Data mining using mlc++ : A machine learning library in c++. In Tools with AI, 1996. [23] B. Lent, A. Swami, and J. Widom. Clustering association rules. In ICDE ’97, 1996. [24] O. P. Ltd. The olap report. http://www.olapreport.com/, 2005. [25] H. Mannila, H. Toivonen, and A. Inkeri Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3):259–289, 1997. [26] M.H.Dunham. Data Mining: Introductory and Advanced Topics. Prentice Hall, 2003. [27] A. Moore and M. S. Lee. Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research, March:67–91, 1998. [28] R. Parekh, J. Yang, and V. Honavar. Constructive neural-network learning algorithms for pattern classification. IEEE Transactions on neural networks, 11(2), MARCH 2000. [29] T. R. Payne, P. Edwards, and C. L. Green. Experience with rule induction and k-nearest neighbor methods for interface agents that learn. IEEE Transactions on knowledge and data engineering2, 9(2), MARCH–APRIL 1997. [30] P.Kam and A.Wai-chee Fu. Discovering temporal patterns for interval-based events. In Proc. of DaWaK 2000, pages 317–332, London, UK, September 2000.

[31] O. R.Zaiane. Data abstraction. http://www.cs.sfu.ca/CC/354/zaiane/material/notes/Chapter1/node4.htm [32] J. Smith. Using data mining for plant maintenance. Plant Engineering, December 2001, 2002. [33] R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In Proc. of the ACM SIGMOD, 1996. [34] R. Srikant and R. Agrawal. Mining sequential patterns. generalizations and performance improvements. In Proc. of the Fifth Int’l Conference on Extending Database Technology, Avignon, France, March 1996. [35] K. Thearling, B. Becker, D. DeCoste, B. Mawby, M. Pilote, and D. Sommerfield. Information Visualization in Data Mining and Knowledge Discovery, chapter 15. Morgan Kaufman, 2001. [36] J. T.-L. Wang, G. Chirn, T. Marr, B. Shapiro, D. Shasha, and K. Zhang. Combinatorial pattern discovery for scientific data: Some preliminary results. In Proc. of the 1994 ACM SIGMOD, pages 115–125, Minneapolis, Minnesota, US, May 1994.

119

Bibliography

[37] K. Wang and J. Tan. Incremental discovery of sequential patterns. In Proc. of Workshop on Research Issues on Data Mining in cooperation with ACM-SIGMOD’96, Montreal, Canada, June 1996. [38] X. S. Wang, S. Jajodia, Y. Li, and P. Ning.

Discovering calendar-based temporal

association rules. In Eigth International Symposium on Temporal Representation and Reasoning,TIME-01, pages 111–118, Civdale del Friuli, Italy, June 2001. [39] X.Chen and I.Petrounias. An integrated query and mining system for temporal association rules. In Proc. of DaWaK 2000, pages 327–337, London, UK, September 2000. [40] Y.Aumann and Y. Lindell. A statistical theory for quantitative association rules. Journal of Intelligent Information Systems, May:255–283, 2003. [41] M. Zaki. Spade: An efficient algorithm for mining frequent sequences. Machine Learning Journal, 42(1, 2):31–60, January, February 2001. special issue on Unsupervised Learning (Doug Fisher, ed.). [42] G. Zimbrao, J. M. de Souza, V. T. de Almeida, and W. A. da Silva. An algorithm to discover calendar-based temporal association rules with item’s lifespan restriction. In The Second Workshop on Temporal Data Mining, Edmonton, Alberta, CANADA, July 2002.