Curiosity Driven Incremental LDA Agent Active Learning

1 downloads 0 Views 5MB Size Report
ILDA agent learns passively every instance confronted, even if the instance is confirmed redundant or noise data. where x~ == nc~l (ncx c+y) and n~ == ti; +1, if y ...
KEDRI-NICT Project Report - APPENDIX:G

Proceedings of International Joint Conference on Neural Networks, Atlanta, Georgia, USA, June 14-19, 2009

Curiosity Driven Incremental LDA Agent Active Learning Shaoning Pang, Seiichi Ozawa, and Nik Kasabov Abstract- This paper presented a novel active linear discriminant analysis (LDA) learning method in the form of curiositydriven incremental LDA (cILDA) and multiple cILDA agents cooperative learning (mcILDA). The curiosity in psychology here is modelled mathematically as a discriminability residue inbetween instance space and its corresponding eigenspace. As the learning proceeds, the curiosity of an individual agent updates over time by two incremental learning processes: One updates the characterization of eigenspace and another re-calculates the curiosity. In the multi-agent scenario, individual agent communicates and cooperates with each other at every learning stage to discover the discriminant characterization of the whole pattern. In the experiment, we described how the discriminative instances could be significantly selected based on the curiosity with, at most, minor sacrifices in learning rate and classification accuracy. The experimental results show that the proposed curiosity learning performs gracefully under different level of redundancy, and the proposed cILDA/mcILDA learning system is capable of learning less instances, but has more often an improved discrimination performance. I. INTRODUCTION

Datasets acquired in real practices are vast, inaccurate, inconsistent and unlabelled. It becomes laborious, costly and time consuming process to label the data for reasons such as classification or knowledge discover which the traditional machine learning algorithms are incapable of achieving. This is where 'active learning' comes in handy. 'Active learning' is a learning algorithm which aims to create an accurate classifier by iteratively selecting the essentially important unlabelled data points by means of adaptive querying and training the classifiers on those data points which are potentially useful for the targeted learning task [1]. Consider supervised machine learning for class discrimination, Linear Discriminant Analysis (LDA) [9] seeks a transformation towards a maximum separation between classes and minimum separation within classes. The LDA has assumed that the entire dataset for training is truly informative and is presented in advance. However in real world applications, data is often being presented at different times in a stream of random chunks, and the quality of data is not guaranteed due to noise affection. Incremental LDA (ILDA) [10] somehow has solved the difficulty of LDA and empowered the LDA with an almost weird flexibility of incremental learning that accommodate a data stream either one instance by another, Shaoning Pang and Nikola Kasabov are with the Knowledge Engineering & Discovery Research Institute, Auckland University of Technology, Private Bag 92006, Auckland 1020, New Zealand (email: {spang.nkasabov}@aut.ac.nz). Seiichi Ozawa is with graduate school of engineering, Faculty of Engineering, Kobe University, Japan (email: [email protected])

or in a step of a chunk of instances, or even with bursts of new class instances presented at different times. In this sense, the ILDA can be used as a intelligent agent for independent incremental learning. But in spite of that, ILDA conducts still a rigid learning because ILDA does not make any instance choices before actual learning, just passively learns whatever instances that are confronted/provided. Specifically, we propose curiosity-driven LDA active learning because that, (1) if the learning is carried out in a limited physical environment, such as the remote space where an intelligent agent is requested to discover a planetary geological pattern, the effective information collection is extremely important because that only limited number of instances are allowed to be transferred back to Earth for further analysis due to the availability constraints of the deep space communication system [11]; (2) just as some attributes are more useful than others, so may some instances better aid the learning process than others. Given noise conditions of real world, instance selection is to increase the efficiency of learning by focusing attention on informative and most importantly discriminative instances. Motivated by the curiosity learning, we modelled here a new curiosity-driven incremental LDA (cILDA) agent to enable the agent to actively perform an incremental learning by learning only on the instances with discriminative curiosity. For consistency, we called those discriminative instances as curiosity instance in the rest of paper. Based on the cILDA, we developed further a multi-agent cooperative learning system (mcILDA), where multiple of clLDA agents are set to cooperate for discovering the discriminant character of a global pattern. We experimented the proposed method with various datasets with different level discriminative redundancy, where the capability of the proposed method on learning less curiosity instances, and having no decrease on the resulting eigenspace discriminability. II. RELATED RESEARCHES

A. Overview of Active Learning In the literature, active learning has been researched in the following 3 aspects: (1) Query-based approaches. Commonly used approaches in this category include pool-based active learning, membership queries, and Query by committee [7]. Pool-based active learning uses an unlabelled sample pool by explicitly estimating sample density when selecting examples for labelling. The approach is computationally intensive because that most of pool-based active learning iteratively selects each example

978-1-4244-3553-1/09/$25.00 ©2009 IEEE Authorized licensed use limited to: Auckland University of Technology. Downloaded on February 28,2010 at 16:22:07 EST from IEEE Xplore. Restrictions apply.

2401

from the pool to which one is informative or irrelevant. Also, the technique is not effective for reducing the error rate, when the pool is too small. As a comparison, membership query [2] is an active learning algorithm that constructs examples directly from the dataset for the purpose of querying for labels, thus such membership query scheme does not have the drawbacks posed by the poolbased scheme and reduces the predictive error rapidly and is less computationally intensive. Another known query-based approach is Query by Committee (QBC). QBC select data for labelling based on the disagreement amongst ensemble of hypotheses. The commonly used QBC method, ensemble with active learning, includes techniques such as Bagging and Boosting. (2) Active classifier approaches. An active classifier approach normally obtain the values of unlabelled data at some cost which is calculated using probably-approximately-correct (PAC) model, where the calculation is based on the cost required to obtain additional values versus the penalty imposed on inaccurate classification. Alternatively, transductive learning scheme [3] was used to directly reduce the assessed uncertainty of predictions on given unlabelled data, and thus effectively explore the information of unlabelled data in active learning. Additionally, clustering and batch mode active learning [4] are the other flavors of active learning which aims at decreasing the redundancy amongst the selected examples therefore providing more unique examples for the refinement of classifiers. For example, incorporation of active learning with support vector machine has been used in the field of bioinformatics and text categorization. However, it is noticeable that majority of these approaches use the pool-based technique which suffers from many drawbacks stated previously. Thus, it is recommended that though incorporation of active learning with SVM is good, other approaches such as membership querying or batch mode active learning are more practically useful as they negate the drawbacks introduced by pool based learning. (3)Curiosity-based Approaches. Methods such as maximum curiosity, curiosity composite classifier function by assuming each unlabelled example as informative then non-informative or all possible labels that the example may have, then select the classifier which gives highest accuracy on cross validation [8]. Other methods of this aspect are additive curiosity and additive Bayesian surprise Methods such as Minimum Marginal Hyperplane and Maximum Entropy choose the next unlabelled examples based on their closeness to the decision boundary. The problems introduced by these schemes are that they select untypical data points ignoring the data distribution which leads to poor performance. On the other hand, methods like Maximum Marginal Hyperplane, Minimum Entropy, and Entropy Tradeoff work exactly opposite of the above methods, where the unknown labels are selected which are farthest from the decision boundary [6].

B. Overview of Curiosity Learning Some model of artificial curiosity and intrinsic motivation are proposed[12]-[17]. These models are based on an architecture which comprises a unit which learns to anticipate the consequences of the agent's actions. In these models, agent's actions are actively chosen according to curiosity. Curiosity is defined as the error between the consequences to actions and agent's prediction of one. This curiosity means "specific curiosity", not include "diversive curiosity". The existing models can be divided into three groups, according to the way action-selection is made depending on the predictions [12]. Type 1 Error maximization In the first type [13], [14], agents directly use the error between the consequences to actions and agent's prediction. The action that they choose at each step is the one for which the agent predicts the largest error in prediction of learning unit. This has shown to be efficient when the learning unit has to learn a mapping which is learnable, deterministic and with homogeneous Gaussian noise [13], [14]. But this method shows limitations when used in a real uncontrolled environment. Type 2 Learning progress maximization A second type of models tried to avoid unlearnable situations by using more indirectly the prediction of the error of learning unit [15], [16]. The key-point of these models is that the interestingness of candidate situations are evaluated using the difference between the expected mean error rate of the predictions of learning unit in the close future, and the mean error rate in the close past. For each situation that the agent encounters, it is given an internal reward which is equal to the inverse of this difference (which also corresponds to the local derivative of the error rate curve of learning unit). The internal reward is positive when the error rate decreases, and negative when it increases. This model chose the action that it will lead to the greatest decrease of the mean error rate of learning unit. However these models have only one learning unit. Thus, the error between different situations serves as internal rewards. And indeed, using this direct measure of the decrease in the error rate in prediction will provide the agent with internal rewards when shifting from an activity with a high mean error rate to activities with a lower mean error rate, which can be higher than the rewards corresponding to an effective increase of the skills of the agent in one of the activities. This will push the agent towards instable behavior, in which it focuses on the sudden shifts between different kinds of activities rather than concentrate on the actual activities. Type 3 Similarity-based progress maximization A third type of models has some learning unit in order to learn for each situation [17], [12]. Each learning unit learns the similarity situations. This local learning system evaluates using the difference between the expected mean error rate of the predictions of learning unit in the close future, and the

2402 Authorized licensed use limited to: Auckland University of Technology. Downloaded on February 28,2010 at 16:22:07 EST from IEEE Xplore. Restrictions apply.

mean error rate in the close past for each learning unit. Thus, autonomous learning (which the complexity to learn increases gradually) is possible for the agent. A related approach was proposed presenting an implementation of the idea of evaluation the learning progress by monitoring the evolution of the error rate in similar situations [17]. The implementation described was tested for discrete environments like a two-dimensional grid virtual world on which an agent could move and do one of four discrete actions. The similarity of two situations was evaluated by a binary function stating whether they correspond exactly to the same discrete state or not. It was shown that in this case the system can significantly speed up the learning, even if some parts of the space are pure noise. However, because the system was only tested on a discrete simulated environment, it is difficult to generalize the results to the general case in which the environment and action spaces are continuous, and where two situations are never numerically exactly the same. In the another model [12], the agent makes the learning unit increase so that the input to learning unit may become equal. This system sets a suitable number of the learning unit according to environmental complexity. The author shows the validity of this model by robotic simulation using the Webots software and the playground experiment using the Sony AIBO robot. Experimental results show that the robot first spends time in situations which are easy to learn, then shifts its attention progressively to situations of increasing difficulty, avoiding situations in which nothing can be learnt. The author discusses in relation to more complex forms of behavioural organization and data coming from developmental psychology.

in the future. Pang et. al. [10] has developed a sequential incremental LDA (ILDA) briefed as follows, which has enabled LDA to be used as an agent with the capability of incremental learning new data instance by instance. Given the (N + l)th training instance y presented with class label as k, the updated discriminant eigenspace model 0' == (8w', si: x', N + 1) for [X y] can be computed over o and y. The updated mean is:

x' == (Nx

For between-class scatter matrix Sb, if k == M

-k

M

(x - xc)(x - xc)T,

(1)

c=l XE{X c}

the between-class scatter matrix

L nc(x c c=l

L

c=l

n~(xc - x')(x c - x')T

(4)

where n~ is the number of instances in class c after y is presented, n~ == n c, when 1 ~ c ~ M; n~ == 1, when c == M + 1. Xc == y, when c == M + 1. else if 1 ~ c ~ M, then the updated matrix 8b' is M

8b' ==

L n~(x~ c=l

x')(x~ - x')T

(5)

where x~ == nc~l (ncx c + y) and n~ == ti; + 1, if y belongs to class c; else x~ == Xc and n~ == nco For within-class scatter matrix 8w, if y is a new class instance, which means k is the (M + l)th class, then the updated within-class scatter matrix actually has no change,

else if 1

~

c

~

(6)

M, then the updated Sb matrix is,

The above ILDA can be used to construct an agent capable of updating the current discriminant knowledge O(t) by O(t+ 1) == F(O(t), y) whenever a new instance y is confronted by the agent in the future. However, the ILDA is counted as a passive agent in the sense of active learning, because the ILDA agent learns passively every instance confronted, even if the instance is confirmed redundant or noise data.

B. Curiosity-driven ILDA (cILDA) Agent

M

8b ==

+ 1, then

8w' == 8w.

The classical LDA works in a batch way assuming that the whole dataset is given in advance and is trained in one batch. Given training data X == {Xi}, (i == 1, ..., N), n c is the number of instances in class c such that N == E:1 n c, and Xc is the class mean vector in class c, then a classical LDA discriminant eigenspace can be constructed as 0 == ~w, 8b, X, N), where the instance mean vector X == Ei=l Xi, the within-class scatter matrix,

L L

(3)

M+1

8b' ==

III. CURIOSITY-DRIVEN ILDA AGENT

8w ==

+ y)/(N + 1)

x)(x c - x)T,

(2)

and the linear transformation U is computed by an eigenvalue decomposition, DU == U A, where matrix D is defined as 8w- 18b to achieve that the transformed data by U have the maximum Sb and the minimum 8w.

A. Incremental LDA For agent construction, the above classical LDA is required to accommodate new instances confronted at different times

For active learning, we considered here a curiosity-driven ILDA (cILDA) to empower the ILDA agent with the ability of detecting the discriminative interestingness of data before it is delivered for ILDA learning. Recall that LDA targets on the discriminant eigenspace learning, the learning effectiveness lies at the discriminability difference between the LDA transformed space and the original space. In this sense, the curiosity of LDA should be defined as a type of residue that reflects the discriminability difference between the LDA transformed space and the original space.

2403 Authorized licensed use limited to: Auckland University of Technology. Downloaded on February 28,2010 at 16:22:07 EST from IEEE Xplore. Restrictions apply.

Straightforwardly, when m agents exist, we measured the discriminability difference e of the nth agent at time t by a classification performance evaluation as, (8) where Adn ( .) is the classification accuracy on discriminant eigenspace, and AOn (.) is the accuracy on original space. It could be any type of classification performance evaluation by any classifier. For a fast curiosity detection, a Leave-one-out (LOO) K nearest neighbor (KNN) with k == 1 is used in our experiments. As the curiosity varies over time, we evaluated in practice a more reliable discriminability change as the smoothed derivative of the error curve of e corresponding to a set of recent confronted instances,

< en (t ) >== < en (t

-

T

L~=o() en(t - i)

(9)

+1

) >-_ L~=o en(t () +1

T -

i)

L(t) ==< en(t) > - < en(t - T) >

(10)

(11)

For an independent agent, curiosity arises when L obtains a positive value. Using Eq. (11), the above ILDA agent can be renovated into a curiosity-driven ILDA (cILDA) as,

+ 1) == {

Fc(O(t), y) O(t)

n,

==

nrc

==

({~qC}ICE{Cq}' Sb q , {XqC}ICE{C q } ' {nqC}ICE{C q } '

ifL(t) > e otherwise.

(12)

where only discriminative instances are allowed for ILDA learning. is the threshold for an independent cILDA agent, the smaller leads to the bigger number of curiosity instances, and a better discriminability of resulting LDA.

e e

IV. MULTIPLE cILDAs COOPERATIVE LEARNING (McILDA) In a multi-agent environment, cILDA agents are required to shares their LDA knowledge each other at every stage of incremental learning, since the curiosity calculation should be updated based on the current global LDA knowledge (i.e the summary of all agents current LDA knowledge). Also, the knowledge acquired by each individual cILDA agent is needed to be merged at the final stage so that the discriminant characterization of the pattern is output as, m

(13) where m, m 2:: 1 is the number of cILDA agents. Thus, LDA merging U is developed as follows: given a discriminant eigenspace 0 == (Sw, Sb, x, N) with the class L {c} -.». IabeI set {c, } we have N == ~ LJ{ } n c , x == L {c} ri.. ' and c

{cq } ) ,

then the problem of LDA merging is to compute the combined discriminant eigenspace model Or for Z == [X U Y] using only Op and Oq. Specifically, it is equivalent to computing the merged {~rc}ICE{Cr}' Sb r , {Xrc}lcE{Cr}' {nrc}lcE{Cr}' and {cr } ) , respectively. Clearly, the {cr } is the combination of {cp } and {cq } but with the repetition classes removed. That is, {cr } == unique( {c p } U {c q } ), where unique(.) is set unique function. With {cr } , the combined {nrc} ICE {c.. } is calculated as follows: for VCr E {c,.}, if C r E {cp } and C r :3 {cq } elseif C r :3 {cp } and Cr E {cq } otherwise C r E {cp } and C r E {cq } .

npc

where T is a time window parameter, and () a smoothing parameter. Then, the discriminability difference at time (t) can be calculated by,

O(t

Sw == L{c} ~c whose ~c is the within-class scatter matrix in class c. Thus, the eigenspace 0 can be transformed to another form, ({~c} ICE{C}' Sb, {xc} ICE{C}' {n c} ICE{C}' {c}). Given two discriminant eigenspace models, n, on X and n, on Y, where n, ({~PC}ICE{Cp}' Sb p , {XpC}ICE{C p } ' {npC}ICE{C p } ' {cp } ) , and

{

n qc n pc

+ n qc

In a similar way, the combined class mean vector {x r c } ICE{C r} can be obtained as follows: for VCr E {c,.}, if C r E {cp } and C r :3 {cq } elseif C r :3 {cp } and Cr E {cq } otherwise C r E {cp } and C r E {cq } .

Xpc x rc

==

x qc {

npXpc+nqcXqc npc+n qc

Naturally, the instance mean vector _

Xr

==

L{c r

}

L{c r

x can be updated as

nrcx r c }

nrc

.

(16)

Submitting the above merged class mean-vectors x r c and the merged instance mean vector x r into the original betweenclass matrix function Eq. (2), we have the combined Sb; as, Sb r == L{c r } nrc(x r c - xr)(x r c - xr)T. For the combined between-class scatter matrix {~rc}ICE{Cr}' we compute ~ for each C r E {cr } as follows: if C r E {cp } and C r :3 {cq } then ~rc == ~pc; else if C r E {cq } and C r :3 {cp } then ~rc == ~qc; otherwise C r E {cp } and C r E {c q } , in other words C r is the common class of Op and Oq, then it is not difficult to proof that ~rc

==

~pc

+ ~qc +

n n

pc qc (x pc - x qc) (x pc - x qc) T . n pc n qc

+

(17)

A. The Proposed cILDA/mcILDA Algorithm Fig. 1 shows the flowchart of the proposed mcILDA learning. To the beginning, instance space is divided and allocated with a set of agents. Next, the discriminant eigenspace model and curiosity are initialized by diversive curiosity learning. Then, the instance of an high effect to learning is acquired for curiosity learning. Processing is finished when all the agents lose curiosity. Note that the proposed mcILDA is a cILDA when the number of cILDA agent in mcILDA is 1. [Algorithm: Multiple cILDA agents Cooperative Learning]

2404 Authorized licensed use limited to: Auckland University of Technology. Downloaded on February 28,2010 at 16:22:07 EST from IEEE Xplore. Restrictions apply.

(f) update the current curiosity by Eq. (11),

Lj(t)

Initialization of Curiosity Lj(O)(i =

O,···,~V)

Step Step

Initialization of LDA agent knowledge OJ (O)(i

Step

= 0,···, ~V)

Step

Acquiring instance x Update the LDA knowledge using cILDA Qj(t+l)

~

Qj(t)

cILDA agent knowledge sharing Oj~UOi

no

~

Lj(t + 1)

(g) LDA knowledge share among m cILDA agents by Eq. (13) 4: Go to Step 3 for the next round active learning. 5: The above loop ends when no variable curiosity (Vi, Li(t) < e) is found by any of the m agents. 6: Merge discriminant eigenspaces from all agents Eq. (13) by an iterative LDA merging described in Section ?? 7: Output curiosity instances and the obtained LDA eigenspace at Step 7.

V. EXPERIMENTS AND DISCUSSIONS In this section, we have examined the efficiency and accuracy of the proposed method, and compared to LDA/ILDA. Particularly, we investigate the relationship between 1) the discriminability and number of instances, 2) the redundancy and number of instances, 3) the discriminability and number of agents. To experiment on data with different discriminative characterization, we used datasets from two database resources. One resource is from VCI Machine Learning Repository [18], where we selected 8 datasets that have different application backgrounds and the features 100% of continuous/integer values and no missing value. The other resource is the MPEG- 7 face database [19], which consists of pose and light two subsets, total 1355 face images of 271 persons, 5 different face images per person and each face image has the size of 56 x 46.

A. Experimental Setup Fig. 1. Block diagram of the proposed curiosity-driven multi-agent cooperative LDA learning.

e,

Input: m, the number of used agents; X instance set; curiosity threshold value. Output: 0 the obtained discriminant eigenspace obtained by curiosity-driven multi-agent cooperative learning. Step 1: Allocate m cILDA agents randomly to instance space. Step 2: For each agent F e l (i == 0, ... , m) perform the following curiosity and LDA initialization learning steps: (a) acquires data independently. (b) initialize agent discriminant eigenspace Oi (0), when more than 2 classes instances are acquired. (c) initialize the curiosity L, (0) by acquiring instance until the curiosity by Eq. (11) is arisen. Step 3: For each ILDA agent F ej , perform the following specific curiosity learning steps: (d) proceed the learning by acquiring new instances independently. (e) update the discriminant eigenspace by cILDA, OJ(t) ~ OJ(t + 1).

For each dataset, we implemented the proposed cILDA and mcILDA learning method. As mcILDA is equivalent to cILDA when the number of cILDA in mcILDA is 1, we experimented cILDA and mcILDA as one model, whose number of cILDA is determined by cross validation experiments. As the learning going on, we collected every curiosity instance selected by each agent for every learning stage. For performance evaluation, we compared the eigenspace over selected curiosity instances with the eigenspace over the all instances (i.e. the entire dataset) by performing a Leaveone-out K-NN classification on the all instances. Concerning features selection, we rank the eigenvectors by their energy, and select a set of top energy eigenvectors. The number of eigenvectors used in all models under comparison is the same. In the experiment, parameter () and T are not sensitive to the curiosity learning performance, but the speed of curiosity calculation. So, for a quick curiosity detection, we set () as 3 and T as 2. is relevant to the number of curiosity instances and the discriminability of the resulting LDA. For each experiments, we fixed by the rule that the curiosity instances is significantly selected with, at most, minor sacrifices in discriminability. For the convenience of description, we use the term learning stage instead of the usual time scale since the events of data arriving in the above incremental learning may not happen in

e

e

2405 Authorized licensed use limited to: Auckland University of Technology. Downloaded on February 28,2010 at 16:22:07 EST from IEEE Xplore. Restrictions apply.

Differenc e: 0 .002 57 1

0 .25 1

- - ----.- - - ---,-- - - ---.-----;= = = = = = :;.,

I---e---- mclLDA --+-- ILDA

0.2

( a)

o

50

100

150

200

2 50

Lea rning Stages

Fig. 3. The comparison of melLOA and ILOA on the performance of incremental learning.

60



40

X2

20

10 0

0

( b)

Fig. 2. The comparison of data distribution between the synthetic dataset and selected curiosity instances by proposed elLONmelLOA learning method. (a) the data distribution of the entire dataset; and (b) the data distribution of selected curiosity samples.

a regular time interval. Here, the number of learning stages is equivalent to the number of instances that have been learned by incremental models.

B. Synthetic Dataset We first experimented the proposed cILDA/mcILDA with a synthetic data set that has 3 classes 213 instances. The data distribution is a mixture of several 2D ([Xl X2]) Gaussian distributions as shown in Fig. 2. We investigated the data distribution using the proposed mcILDA, where the number of cILDA agents is 5. Fig. 3 gives the distribution of the 40 curiosity instances obtained by proposed curiosity learning method. As compared to the data distribution of the entire 213 instances, the discriminative representativeness of the selected curiosity instances is clear because the obtained curiosity instances includes all critical instances for class distinction, such as instances involving class-mixture, and major representative instances of the independent class. Fig. ?? illustrates the whole procedure of incremental learning with a comparison to ILDA, where the horizontal and vertical axis represent the incremental stage and the classification error from k-NN (k=l). As seen from the figure, the proposed mcILDA and ILDA is compared on the classification error at every incremental learning step. The absolute difference on the classification error between two methods is 0.002571, which

indicate that the proposed mcILDA achieves 99.74% learning effectiveness of the orginal ILDA, although mcILDA learns only 18.78% of total 213 instances learned by ILDA. C. UCI Datasets

The second experiment targeted testing the effectiveness of the curiosity modelling on DCI datasets. For each DCI dataset, the proposed curiosity ILDA learning returns a set of curiosity instances, and the obtained discriminant eigenspace . Table I gives an comparison of LDA over curiosity instances versus LDA over all instances on the classification of the entire dataset. In the table, ~ is fixed for each dataset, the number of curiosity instances and the percentage to the number of all instances is denoted as 'No. lnstances(rate)', and the classification accuracies is denoted as 'Acc.'. The discriminability difference (denoted as 'Diff.') is calculated as the proposed method minus batch LDA in terms of the K-NN LOO classification over all instances of the dataset. As seen in the table, the proposed curiosity LDA method, ignores 31%-79% instances of the whole dataset, constructs discriminant eigenspaces only on the remaining 21%-69% curiosity instances. However, the discriminability of the obtained eigenspace from curiosity instances, compared to the eigenspace from all instances (using batch LDA), has no decrease, reversely, most of case has a slight increase. This suggests that the proposed curiosity learning is valid, and importantly the obtained curiosity instances have the expected discriminative representativeness.

D. Performance under different level discriminative redundancy To test the performance of the proposed method under different level discriminative redundancy, we studied the Face Membership Authentication (FMA) problem [20], [21], [22]. The FMA is to distinguish the membership class (cls. 1) from the non-membership class (cls. 2) in a total group through a binary class classification. We considered the FMA as a

2406 Authorized licensed use limited to: Auckland University of Technology. Downloaded on February 28,2010 at 16:22:07 EST from IEEE Xplore. Restrictions apply.

TABLE I COMPARISON OF LDA OVER CURIOSITY INSTANCES VERSUS LDA OVER ALL INSTANCES OVER 8 UCI DATASETS. cILDA/mcILDA No. Instances(rate[%])

Acc.[%]

No. Instances

3.0e-3

Acc.[%]

Diff.[%]

5

41 (27.3)

98.0

150

Liver-disorder

3.0e-3

98.0

±O.O

8

203 (58.8)

65.7

345

62.6

Vehicle

+3.1

3.0e-3

16

251 (29.7)

77.6

846

75.4

+2.2

Glass

3.0e-3

1

98 (45.8)

71.5

214

67.7

+3.8 +1.6

~

Iris

Wine

5.0e-3

11

115 (64.6)

98.2

178

96.6

Sonar

5.0e-3

21

143 (68.8)

84.3

208

81.2

+3.1

Balance

3.0e-3

7

149 (23.8)

96.4

625

93.9

+2.5

Heart

3.0e-3

4

63 (21.2)

55.6

297

55.6

±O.O

category of problem involving different levels of discriminative redundancy because that the size of membership is often smaller than that of nonmembership, which implies that not every instance are discriminatively important for FMA. In addition, such discriminative redundancy is manually adjustable, because the size of the membership group can be dynamically changed. The smaller size of the membership group, the higher discriminative redundancy involved in the FMA case study. We conducted FMA experiments using the whole image database that contains total 271 persons 1355 face images. We set the membership size ranging from 20 (cls, l/cls. 2: 20/251) to 135 (cls. l/cls. 2: 135/136) with a 10 persons interval to achieve datasets with different levels discriminative redundancy, and compared the proposed mclLDA (the number agents is 3) with the classic batch LDA on FMA under the condition of different levels discriminative redundancy, where is also fixed as 0.003. Fig. 4(a) shows the comparison of LDA discriminability between the proposed cILDA/mcILDA method and the batch LDA method under different levels discriminative redundancy (equivalent to different sizes of membership group), and Fig. 4(b) reports corresponding the number of curiosity instances selected by the proposed curiosity method with a comparison to the original number of member and non-member samples. As seen in Fig. 4(b), the proposed curiosity LDA method learns LDA on 1000 to 1100 curiosity instances under different sizes of membership. It follows that 28.8%-26.2% of total 1355 instances are reduced, and the performance of the proposed curiosity LDA method given in Fig. 4(a) method outperforms in most cases, the performance of the batch LDA on all 1355 instances. Interestingly, the number of curiosity instances increases as the size of membership increases due to the decrease of the discriminative redundancy. However, the number of curiosity instance in membership is increasing, and the number of curiosity instance in non-membership is decreasing. As the size of membership is 135, the membership and non-membership instances become similar (cls, l/cls. 2: 135 x 5/136 x 5), the selected curiosity instances by the proposed method gives a similar membership and nonmembership percentage too. This

e

Batch LDA

No. Agents

Datasets

indicates that the proposed curiosity LDA is able to correctly response to deferent level discriminative redundancy with a different number curiosity instances in each class. VI. CONCLUSIONS AND FUTURE'S WORK Unlike the previous LDA learning methods that passively learns whatever instances presented, the proposed method provides such an active learning that a biological curiosity concept is accommodated in a ILDA agent, and a group of such curiosity-driven agents discover jointly the discriminant characterization of a whole pattern through a cooperative incrementallearning. Over the datasets from different resources, the proposed curiosity LDA learning method is evaluated on: (1) with versus without curiosity, and (2) performance under different level redundancy. The experimental results demonstrate that the proposed curiosity-driven LDA learning helps more efficient learning LDA on less instances, but with no performance deduction. One limitation of the proposed method concerns, we used a simple K-NN method for both curiosity evaluation and performance evaluation and comparison. Thus, the representativeness of selected curiosity instances is beneficial only to KNN classification and the category of prototype-based methods such as Bayes classifiers [23], [24], radial basis function neural networks [25]. However, it may not be the same significant to the classification using other methods such as hyperplane based Support Vector Machines (SVM) and decision-tree based C4.5. As a straightforward solution is, one can replace the K-NN in Eq. (8) with his own classifier, then the whole system turns the significance to the replaced classifier. Note that the used classifier is required to have a high computational efficiency as the curiosity calculation requires a quick response for online learning. The proposed mclLDA method, due to the cooperative learning setup, may generate more than one curiosity instance at one time, which causes redundancy left in those curiosity instances. Although our previous work has presented a successful practice on this point by introducing a competition schema for multi-agent ILDA cooperative learning [26], both works are just a sequential learning method, whereas real applications require curiosity to be detected out from a chunk

2407 Authorized licensed use limited to: Auckland University of Technology. Downloaded on February 28,2010 at 16:22:07 EST from IEEE Xplore. Restrictions apply.

0 .98 --+-- Batch LDA -e--- mcl LDA

0 .96 ~

§ 0 .94

-c

co 0 .92 0

~~

0 .9

"5 -c 0 .88

s c.

0 .86

Q)

.0

E Q) E 0 .84

::;:

0 .82 0 .8 20

40

60

80

100

14 0

Size of Members hip Group

(a) 1 20 0

10 0 0

'" Q)

Q.

E

'" '"0

800