Learning Decision Trees Adaptively from Data

0 downloads 0 Views 1MB Size Report
ADWIN-DT Decision Tree. Experiments. Conclusions. Learning Decision Trees Adaptively from Data. Streams with Time Drift. Albert Bifet and Ricard Gavaldà.
Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Learning Decision Trees Adaptively from Data Streams with Time Drift Albert Bifet and Ricard Gavaldà LARCA: Laboratori d’Algorísmica Relacional, Complexitat i Aprenentatge Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

September 2007

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Let π be a permutation of {1, . . . , n}. Let π−1 be π with one element missing. π−1 [i] arrives in increasing order Task: Determine the missing number

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Let π be a permutation of {1, . . . , n}. Let π−1 be π with one element missing. π−1 [i] arrives in increasing order Task: Determine the missing number

Use a n-bit vector to memorize all the numbers (O(n) space)

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Let π be a permutation of {1, . . . , n}. Let π−1 be π with one element missing. π−1 [i] arrives in increasing order Task: Determine the missing number

Data Streams: O(log(n)) space.

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Let π be a permutation of {1, . . . , n}.

Data Streams: O(log(n)) space. Store

Let π−1 be π with one element missing.

n(n + 1) X − π−1 [j]. 2

Puzzle: Finding Missing Numbers

π−1 [i] arrives in increasing order Task: Determine the missing number

j≤i

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Data Streams

Data Streams At any time t in the data stream, we would like the per-item processing time and storage to be simultaneously O(log k (N, t)). Approximation algorithms Small error rate with high probability An algorithm (, δ)−approximates F if it outputs F˜ for which Pr[|F˜ − F | > F ] < δ.

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Data Streams Approximation Algorithms Frequency moments Frequency moments of a stream A = {a1 , . . . , aN }: Fk =

v X

fik

i=1

where fi is the frequency of i in the sequence, and k ≥ 0 F0 : number of distinct elements on the sequence F1 : length of the sequence F2 : self-join size, the repeat rate, or as Gini’s index of homogeneity Sketches can approximate F0 , F1 , F2 in O(log v + log N) space. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximation

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Classification Example Data set that describes e-mail features for deciding if it is spam.

Contains “Money” yes yes no no no yes

Domain type com edu com edu com cat

Has attach. yes no yes no no no

Time received night night night day day day

Assume we have to classify the following new instance: Contains Domain Has Time “Money” type attach. received spam yes edu yes day ?

spam yes yes yes no no yes

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Classification Assume we have to classify the following new instance: Contains Domain Has Time “Money” type attach. received spam yes edu yes day ?

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Decision Trees

Basic induction strategy: A ← the “best” decision attribute for next node Assign A as decision attribute for node For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

VFDT / CVFDT Very Fast Decision Tree: VFDT Pedro Domingos and Geoff Hulten. Mining high-speed data streams. 2000 With high probability, constructs an identical model that a traditional (greedy) method would learn With theoretical guarantees on the error rate

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

VFDT / CVFDT Concept-adapting Very Fast Decision Trees: CVFDT G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. 2001 It keeps its model consistent with a sliding window of examples Construct “alternative branches” as preparation for changes If the alternative branch becomes more accurate, switch of tree branches occurs

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Decision Trees: CVFDT

No theoretical guarantees on the error rate of CVFDT CVFDT parameters : 1

W : is the example window size.

2

T0 : number of examples used to check at each node if the splitting attribute is still the best.

3

T1 : number of examples used to build the alternate tree.

4

T2 : number of examples used to test the accuracy of the alternate tree.

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Decision Trees: ADWIN-DT

ADWIN-DT improvements consist in : replace frequency statistics counters by estimators don’t need a window to store examples, due to the fact that we maintain the statistics data needed with estimators

change the way of checking the substitution of alternate subtrees, using a change detector with theoretical guarantees Summary: 1

Theoretical guarantees

2

No Parameters

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Time Change Detectors and Predictors: A General Framework

Estimation -

xt -

Estimator

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Time Change Detectors and Predictors: A General Framework

Estimation -

xt -

Estimator

Alarm -

Change Detect.

-

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Time Change Detectors and Predictors: A General Framework

Estimation -

xt -

Estimator

Alarm -

Change Detect.

6 6 ? -

Memory

-

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

1 01010110111111

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

10 1010110111111

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

101 010110111111

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

1010 10110111111

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

10101 0110111111

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

101010 110111111

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

1010101 10111111

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

10101011 0111111

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

101010110 111111

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

1010101101 11111

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

10101011011 1111

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

101010110111 111

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]

[Dasu+ 06]

10101011011 1111 [Gama+ 04]

Equal size adjacent subwindows 1010101 1011

Total window against subwindow

ADWIN: All Adjacent subwindows 1111

1010101101111 11

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Window Management Models

W = 101010110111111 Equal & fixed size subwindows

Total window against subwindow

1010 1011011 1111 [Kifer+ 04]

[Gama+ 04]

Equal size adjacent subwindows 1010101 1011 [Dasu+ 06]

10101011011 1111

ADWIN: All Adjacent subwindows 10101011011111 1

1111 11

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN Example W = 101010110111111 W0 = 1

ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN Example W = 101010110111111 W0 = 1 W1 = 01010110111111

ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN Example W = 101010110111111 W0 = 10 W1 = 1010110111111

ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN Example W = 101010110111111 W0 = 101 W1 = 010110111111

ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN Example W = 101010110111111 W0 = 1010 W1 = 10110111111

ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN Example W = 101010110111111 W0 = 10101 W1 = 0110111111

ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN Example W = 101010110111111 W0 = 101010 W1 = 110111111

ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN Example W = 101010110111111 W0 = 1010101 W1 = 10111111

ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN Example W = 101010110111111 W0 = 10101011 W1 = 0111111

ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN Example W = 101010110111111 |ˆ µW0 − µ ˆW1 | ≥ c : CHANGE DET.! W0 = 101010110 W1 = 111111

ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN Example W = 101010110111111 Drop elements from the tail of W W0 = 101010110 W1 = 111111

ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN Example W = 01010110111111 Drop elements from the tail of W W0 = 101010110 W1 = 111111

ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Algorithm ADWIN [BG07]

ADWIN has rigorous guarantees (theorems) On ratio of false positives On ratio of false negatives On the relation of the size of the current window and change rates Other methods in the literature: [Gama+ 04], [Widmer+ 96], [Last 02] don’t provide rigorous guarantees.

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Data Streams Algorithm ADWIN2 [BG07]

ADWIN2 using a Data Stream Sliding Window Model, can provide the exact counts of 1’s in O(1) time per point. tries O(log W ) cutpoints uses O( 1 log W ) memory words the processing time per example is O(log W ) (amortized and worst-case). Essentially same guarantees as ADWIN (up to a multiplicative O(. . .) factor depending on ).

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Decision Trees: ADWIN-DT

ADWIN-DT improvements consist in : replace frequency statistics counters by ADWIN don’t need a window to store examples, due to the fact that we maintain the statistics data needed with ADWINs

change the way of checking the substitution of alternate subtrees, using ADWIN as change detector Summary: 1

Theoretical guarantees

2

No parameters needed

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Experiments

24

Error Rate (%)

22

ADWIN-DT CVFDT

20 18 16 14 12

10 23 00 0 45 00 00 67 0 0 89 00 11 000 1 13 000 30 15 00 5 17 000 70 19 00 9 22 000 10 24 00 3 26 000 5 28 000 70 30 00 9 33 000 1 35 000 30 37 00 5 39 000 70 00

10

Examples

Figure: Learning curve of SEA Concepts using continuous attributes

Introduction

ADWIN-DT Decision Tree

Experiments

Experiments

3,50 1000 10000 100000

3,00

Memory (Mb)

2,50 2,00 1,50 1,00 0,50 0,00 ADWIN-DT Det

CVFDT w=1,000

CVFDT w=10,000

CVFDT w=100,000

ADWIN-DT Est

ADWIN-DT Det+Est

Figure: Memory used on SEA Concepts experiments

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Experiments

200 180 160

1000 10000 100000

140

Time (sec)

120 100 80 60 40 20 0 ADWIN-DT Det

CVFDT w=1,000

CVFDT w=10,000

CVFDT w=100,000

ADWIN-DT Est

ADWIN-DT Det+Est

Figure: Time on SEA Concepts experiments

Conclusions

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Experiments

22%

20%

On-line Error

18% CVFDT ADWIN-DT

16%

14%

12%

10% 1.000

5.000

10.000

15.000

20.000

25.000

CVFDT Window Width

Figure: On-line error on UCI Adult dataset, ordered by the education attribute.

Introduction

ADWIN-DT Decision Tree

Experiments

Conclusions

Conclusions ADWIN-DT improvements consist in : replace frequency statistics counters by ADWIN don’t need a window to store examples, due to the fact that we maintain the statistics data needed with ADWINs

change the way of checking the substitution of alternate subtrees, using ADWIN as change detector Summary: 1

Theoretical guarantees

2

No parameters needed

3

Higher accuracy

4

Less space needed