ADWIN-DT Decision Tree. Experiments. Conclusions. Learning Decision Trees Adaptively from Data. Streams with Time Drift. Albert Bifet and Ricard Gavaldà .
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Learning Decision Trees Adaptively from Data Streams with Time Drift Albert Bifet and Ricard Gavaldà LARCA: Laboratori d’Algorísmica Relacional, Complexitat i Aprenentatge Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya
September 2007
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Let π be a permutation of {1, . . . , n}. Let π−1 be π with one element missing. π−1 [i] arrives in increasing order Task: Determine the missing number
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Let π be a permutation of {1, . . . , n}. Let π−1 be π with one element missing. π−1 [i] arrives in increasing order Task: Determine the missing number
Use a n-bit vector to memorize all the numbers (O(n) space)
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers Let π be a permutation of {1, . . . , n}. Let π−1 be π with one element missing. π−1 [i] arrives in increasing order Task: Determine the missing number
Data Streams: O(log(n)) space.
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Introduction: Data Streams Data Streams Sequence is potentially infinite High amount of data: sublinear space High speed of arrival: sublinear time per example Once an element from a data stream has been processed it is discarded or archived Example Let π be a permutation of {1, . . . , n}.
Data Streams: O(log(n)) space. Store
Let π−1 be π with one element missing.
n(n + 1) X − π−1 [j]. 2
Puzzle: Finding Missing Numbers
π−1 [i] arrives in increasing order Task: Determine the missing number
j≤i
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Data Streams
Data Streams At any time t in the data stream, we would like the per-item processing time and storage to be simultaneously O(log k (N, t)). Approximation algorithms Small error rate with high probability An algorithm (, δ)−approximates F if it outputs F˜ for which Pr[|F˜ − F | > F ] < δ.
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Data Streams Approximation Algorithms Frequency moments Frequency moments of a stream A = {a1 , . . . , aN }: Fk =
v X
fik
i=1
where fi is the frequency of i in the sequence, and k ≥ 0 F0 : number of distinct elements on the sequence F1 : length of the sequence F2 : self-join size, the repeat rate, or as Gini’s index of homogeneity Sketches can approximate F0 , F1 , F2 in O(log v + log N) space. Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximation
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Classification Example Data set that describes e-mail features for deciding if it is spam.
Contains “Money” yes yes no no no yes
Domain type com edu com edu com cat
Has attach. yes no yes no no no
Time received night night night day day day
Assume we have to classify the following new instance: Contains Domain Has Time “Money” type attach. received spam yes edu yes day ?
spam yes yes yes no no yes
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Classification Assume we have to classify the following new instance: Contains Domain Has Time “Money” type attach. received spam yes edu yes day ?
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Decision Trees
Basic induction strategy: A ← the “best” decision attribute for next node Assign A as decision attribute for node For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
VFDT / CVFDT Very Fast Decision Tree: VFDT Pedro Domingos and Geoff Hulten. Mining high-speed data streams. 2000 With high probability, constructs an identical model that a traditional (greedy) method would learn With theoretical guarantees on the error rate
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
VFDT / CVFDT Concept-adapting Very Fast Decision Trees: CVFDT G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. 2001 It keeps its model consistent with a sliding window of examples Construct “alternative branches” as preparation for changes If the alternative branch becomes more accurate, switch of tree branches occurs
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Decision Trees: CVFDT
No theoretical guarantees on the error rate of CVFDT CVFDT parameters : 1
W : is the example window size.
2
T0 : number of examples used to check at each node if the splitting attribute is still the best.
3
T1 : number of examples used to build the alternate tree.
4
T2 : number of examples used to test the accuracy of the alternate tree.
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Decision Trees: ADWIN-DT
ADWIN-DT improvements consist in : replace frequency statistics counters by estimators don’t need a window to store examples, due to the fact that we maintain the statistics data needed with estimators
change the way of checking the substitution of alternate subtrees, using a change detector with theoretical guarantees Summary: 1
Theoretical guarantees
2
No Parameters
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Time Change Detectors and Predictors: A General Framework
Estimation -
xt -
Estimator
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Time Change Detectors and Predictors: A General Framework
Estimation -
xt -
Estimator
Alarm -
Change Detect.
-
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Time Change Detectors and Predictors: A General Framework
Estimation -
xt -
Estimator
Alarm -
Change Detect.
6 6 ? -
Memory
-
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
1 01010110111111
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
10 1010110111111
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
101 010110111111
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
1010 10110111111
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
10101 0110111111
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
101010 110111111
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
1010101 10111111
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
10101011 0111111
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
101010110 111111
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
1010101101 11111
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
10101011011 1111
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
101010110111 111
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows 1010 1011011 1111 [Kifer+ 04]
[Dasu+ 06]
10101011011 1111 [Gama+ 04]
Equal size adjacent subwindows 1010101 1011
Total window against subwindow
ADWIN: All Adjacent subwindows 1111
1010101101111 11
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Window Management Models
W = 101010110111111 Equal & fixed size subwindows
Total window against subwindow
1010 1011011 1111 [Kifer+ 04]
[Gama+ 04]
Equal size adjacent subwindows 1010101 1011 [Dasu+ 06]
10101011011 1111
ADWIN: All Adjacent subwindows 10101011011111 1
1111 11
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN Example W = 101010110111111 W0 = 1
ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN Example W = 101010110111111 W0 = 1 W1 = 01010110111111
ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN Example W = 101010110111111 W0 = 10 W1 = 1010110111111
ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN Example W = 101010110111111 W0 = 101 W1 = 010110111111
ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN Example W = 101010110111111 W0 = 1010 W1 = 10110111111
ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN Example W = 101010110111111 W0 = 10101 W1 = 0110111111
ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN Example W = 101010110111111 W0 = 101010 W1 = 110111111
ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN Example W = 101010110111111 W0 = 1010101 W1 = 10111111
ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN Example W = 101010110111111 W0 = 10101011 W1 = 0111111
ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN Example W = 101010110111111 |ˆ µW0 − µ ˆW1 | ≥ c : CHANGE DET.! W0 = 101010110 W1 = 111111
ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN Example W = 101010110111111 Drop elements from the tail of W W0 = 101010110 W1 = 111111
ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN Example W = 01010110111111 Drop elements from the tail of W W0 = 101010110 W1 = 111111
ADWIN: A DAPTIVE W INDOWING A LGORITHM 1 Initialize Window W 2 for each t > 0 3 do W ← W ∪ {xt } (i.e., add xt to the head of W ) 4 repeat Drop elements from the tail of W 5 until |ˆ µW0 − µ ˆW1 | ≥ c holds 6 for every split of W into W = W0 · W1 7 Output µ ˆW
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Algorithm ADWIN [BG07]
ADWIN has rigorous guarantees (theorems) On ratio of false positives On ratio of false negatives On the relation of the size of the current window and change rates Other methods in the literature: [Gama+ 04], [Widmer+ 96], [Last 02] don’t provide rigorous guarantees.
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Data Streams Algorithm ADWIN2 [BG07]
ADWIN2 using a Data Stream Sliding Window Model, can provide the exact counts of 1’s in O(1) time per point. tries O(log W ) cutpoints uses O( 1 log W ) memory words the processing time per example is O(log W ) (amortized and worst-case). Essentially same guarantees as ADWIN (up to a multiplicative O(. . .) factor depending on ).
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Decision Trees: ADWIN-DT
ADWIN-DT improvements consist in : replace frequency statistics counters by ADWIN don’t need a window to store examples, due to the fact that we maintain the statistics data needed with ADWINs
change the way of checking the substitution of alternate subtrees, using ADWIN as change detector Summary: 1
Theoretical guarantees
2
No parameters needed
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Experiments
24
Error Rate (%)
22
ADWIN-DT CVFDT
20 18 16 14 12
10 23 00 0 45 00 00 67 0 0 89 00 11 000 1 13 000 30 15 00 5 17 000 70 19 00 9 22 000 10 24 00 3 26 000 5 28 000 70 30 00 9 33 000 1 35 000 30 37 00 5 39 000 70 00
10
Examples
Figure: Learning curve of SEA Concepts using continuous attributes
Introduction
ADWIN-DT Decision Tree
Experiments
Experiments
3,50 1000 10000 100000
3,00
Memory (Mb)
2,50 2,00 1,50 1,00 0,50 0,00 ADWIN-DT Det
CVFDT w=1,000
CVFDT w=10,000
CVFDT w=100,000
ADWIN-DT Est
ADWIN-DT Det+Est
Figure: Memory used on SEA Concepts experiments
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Experiments
200 180 160
1000 10000 100000
140
Time (sec)
120 100 80 60 40 20 0 ADWIN-DT Det
CVFDT w=1,000
CVFDT w=10,000
CVFDT w=100,000
ADWIN-DT Est
ADWIN-DT Det+Est
Figure: Time on SEA Concepts experiments
Conclusions
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Experiments
22%
20%
On-line Error
18% CVFDT ADWIN-DT
16%
14%
12%
10% 1.000
5.000
10.000
15.000
20.000
25.000
CVFDT Window Width
Figure: On-line error on UCI Adult dataset, ordered by the education attribute.
Introduction
ADWIN-DT Decision Tree
Experiments
Conclusions
Conclusions ADWIN-DT improvements consist in : replace frequency statistics counters by ADWIN don’t need a window to store examples, due to the fact that we maintain the statistics data needed with ADWINs
change the way of checking the substitution of alternate subtrees, using ADWIN as change detector Summary: 1
Theoretical guarantees
2
No parameters needed
3
Higher accuracy
4
Less space needed