partition differentiation entropy - IEEE Xplore

3 downloads 0 Views 144KB Size Report
Index Terms—Feature selection, Fuzzy-rough sets, λ-Partition differentiation entropy. I. INTRODUCTION. Feature selection [1–8] is a crucial task in data ...
2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD 2017)

Fuzzy-Rough Feature Selection Based on λ-Partition Differentiation Entropy Qian Sun, Yanpeng Qu, Ansheng Deng

Longzhi Yang

Information Technology College Dalian Maritime University Dalian, 116026, P. R. China Email: [email protected]

Department of Computer Science and Digital Technologies Northumbria University Newcastle upon Tyne, NE1 8ST, UK E-mail: [email protected]

Abstract—Fuzzy-rough set theory is proven as an effective tool for feature selection. Whilst promising, many state-of-theart fuzzy-rough feature selection algorithms are time-consuming when dealing with the datasets which have a large quantity of features. In order to address this issue, a λ-partition differentiation entropy fuzzy-rough feature selection (LDE-FRFS) method is proposed in this paper. Such λ-partition differentiation entropy extends the concept of partition differentiation entropy from rough sets to fuzzy-rough sets on the view of a partition of the information system. In this case, it can efficiently gauge the significance of features. Experimental results demonstrate that, by such λ-partition differentiation entropy-based attribute significance, LDE-FRFS outperforms the competitors in terms of both the size of the reduced datasets and the execute time. Index Terms—Feature selection, Fuzzy-rough sets, λ-Partition differentiation entropy.

I. I NTRODUCTION Feature selection [1–8] is a crucial task in data processing. Its fundamental purpose is to reduce the original attributes by eliminating the redundant attributes and maintaining the most representative ones. So it is likely to refrain large storage and obtain an acceptable running time, yet with an improved result. Since its advanced properties, feature selection is widely used in many areas such as medical and biomedical fields [4–6], data mining [7, 8] and etc. As these procedures are computational intensive and time consuming, many efforts have been made to improve their performance by finding an approximative reduct, some of them are based on rough set theory, some are based on fuzzy-rough set theory and the rest are based on other methodologies. Rough set theory [9] is a measure to deal with uncertainty information which gives a popular mathematical framework to get those crucial features. And many practical feature selection algorithms are constructed by rough set theory, such as dependency function based ones, positive region based ones, entropy based ones and discernibility function based ones. In [10], it uses expectation-maximization clustering algorithm to decide the tolerance classes and uses dependency as an evaluation function. And in [11], a positive field keeping method is used for a forward search, which can obtain a good reduct within an acceptable time. The aim of [12] is to provide a feature selection method for large-scale data sets based on information theoretical measurement in terms of partition. In [13], it uses

978-1-5386-2165-3/17/$31.00 ©2017 IEEE

a varietal relative decision entropy and combines roughness with dependency to define a new information entropy model. And in [14], approximative discernibility matrix is applied to produce lower size reduct and a more precise model is derived. The two former methods are essentially equivalent [15]. These algorithms can choose a reduction results in a known decision system with a relatively small time. However, as size of data set becomes larger, it is not easy to charge the uncertainty and get the significant attributes in a short time. To overcome this drawback, in [12], the partition differentiation entropy (PDE) in rough sets was introduced to evaluate features. For a given data set, relating to the class labels, it partitions the original decision information system to several small systems firstly. And then the PDE of feature subsets are computed based on resultant granular information system. In this case, with computing the small sub-systems instead of using the raw large-scale one, the computational complexity is largely decreased. However, the PDE can not deal with numerical datasets directly. In the framework of fuzzy-rough set theory, feature selection is an effective approach to remove the irrelative attributes. Many attentions have been attracted to improve the performance of fuzzy-rough feature selection [1, 2, 15–18]. In [1], the definition of the fuzzy lower approximation in dependency function may lead that the fuzzy lower approximation of the original attribute sets are smaller than that of subset one. To deal with this problem, [16] takes the fuzzy lower approximation as that of the dependency function to study the relevant feature selection issues. The granular structure of the fuzzy lower approximation is adopt in [18], which generalises traditional positive region based reduce to the fuzzy positive region based reduce. The optimisations of the fuzzy lower approximations about the dependency function or the fuzzy positive region are introduced in [3, 17]. Given the significance of entropy theory in estimating uncertainty and robustness of the fuzzy-rough set theory in selecting features, many efforts are paid for the study of fuzzy-rough feature selection by information entropy. In [19], it investigates a information entropy in fuzzy-rough sets to describe the reduce with dependency function and uses the entropy to implement a feature selection algorithm. In [20], the equivalent relationship of entropy for dependency function-

1222

belongs to X, where the fuzzy-rough set of X is defined by (RX, RX).

based reduct does not satisfy to whole fuzzy information systems. Such conditional entropy is not monotonic to general fuzzy information systems. And this may result in improper reducts. To solve this problem, λ-condition entropy is investigated in [15] and the corresponding feature selection algorithm was provided. However, with the growing scale of the datasets, it is difficult to get the importance of features in the short time. In this paper, the concept of PDE is extended from rough sets to fuzzy-rough sets by a λ-partition differentiation (λ-PDE) where λ is self-adaptive threshold to construct the granular framework on the decision information. The fuzzyrough feature selection based on λ-partition differentiation entropy (LDE-FRFS) could deal with the numerical data set which can be represented as a fuzzy decision system. Moreover, comparing to some analogical algorithms, LDEFRFS has shortening running time for the datasets with a large scale of decision classes. Experimental results demonstrate that, by the λ-partition differentiation entropy-based attribute significance, LDE-FRFS outperforms the competitors in terms of both the size of the reduced datasets and the execute time. The rest of this paper is organized as follows. Section II mainly introduces some preliminary concepts of fuzzy-rough sets, fuzzy decision system, attribute selection under fuzzy information system and some definitions about PDE. In Section III, the definition of fuzzy granularity and the properties of λ-PDE are presented. In Section IV, some experiments are conducted to assess performance of these relative algorithm. Section V concludes this paper and the future works.

B. Partition differentiation entropy in rough set In [12], it introduces a new information entropy, entitled partition differentiation entropy (PDE) to get the differences among attributes. It is defined as follows: Let (U, A, D) be a information system in rough set, B ⊆ A, and U/B = {[xi ]B |xi ∈ U } , U/A = {[xi ]A |xi ∈ U }. The value of ||[xi ]B | ∩ |[xi ]A || 1  , (4) log2 E(B|U ⊕ A) = − |U | |[xi ]B | xi ∈U

is the PDE of B in the information system (U, A, D) , where ||[xi ]B | ∩ |[xi ]A ||/|[xi ]B | is defined as the level where the fundamental granule [xi ]A is in [xi ]B . From B ⊆ A, for ∀xi ∈ U , [xi ]A ∩ [xi ]B = [xi ]A can be obtained. Thus, Eq.(4) can be defined simply as |[xi ]A | 1  E(B|U ⊕ A) = − . (5) log2 |U | |[xi ]B | xi ∈U

For a decision information system (U, A, D), B ⊂ A, and U/D = {D1, D2, ..., Dr}, it has: E(B|U ⊕ A ⊕ D) =

A. Fuzzy information system and feature selection in it A fuzzy decision system is a tuple (U, A, D). U is a nonempty finite set of objects, with A∩D = ∅. A is the condition attribute set and D is the decision attribute set with a mapping d : U → VD is defined. VD is set of value which label attribute D takes. F (U ) is the fuzzy power set on U .nFor a fuzzy set X ∈ F (U ), the cardinality of X is |X| = i=1 X(xi ). In [21], the definition of the fuzzy lower and upper approximations are presented to approximate a fuzzy concept X, for ∀x ∈ X,

RA X(x) = infy∈U I(RA (x, y), X(y)).

(2)

(6)

The formula is the PDE of decision information system (U, A, D), where E(B|Dk ⊕ A) is the PDE of B in the sub information system (A, Dk ). PDE reflects the differences in the significance between the part condition attributes and the whole condition attributes. The greater of the value that the condition attribute subsets on the partition shows the more different that the B against to the partition of A. In particular, if the value of E(B|U ⊕ A ⊕ D) is 0 or small enough, B and A will have the same or approximate capacity to describe the object, which means B is the reduct. Based on this, PDE can be applied to select features. For a data set, the raw label information is partitioned into some sub-systems, from which PDE of attribute subsets are computed. Thus, PDE of the subsets in the raw one are got. Besides, by using the sub systems instead of the raw one, the time is significantly smaller. But it can just handle the category data rather than numeric data. Consequently, applying PDE in rough set to fuzzy-rough set can solve the numeric-data feature selection.

II. BACKGROUND

(1)

E(B|Dk ⊕ A).

i=1

This section reviews some essential knowledge about fuzzyrough sets, the definition of fuzzy information system and feature selection in fuzzy information system and the introduce of the PDE in rough set.

RA X(x) = supy∈U T (RA (x, y), X(y)),

m 

III. λ- PARTITION DIFFERENTIATION ENTROPY AND FEATURE SELECTION

Here, I represents the fuzzy implicator and T represents the t-norm. RA is a fuzzy similarity relation induced by feature subset A:  RA (x, y) = (Ra (x, y)). (3)

A. Fuzzy granularity Given a real number λ ∈ (0, 1], a fuzzy granule can be defined as Eq.(7), where A is the whole condition attribute set and B is a subset of attribute condition set.  0, 1 − RB (x, y) ≥ λ, (7) [xλ ]B (y) = λ, 1 − RB (x, y) < λ.

a∈A

From the lower and upper approximation, the degree of x undoubtedly belongs to X and the degree of x possibly 2

1223

For any X ∈ F (U ), RB X(x) = sup{λ : [xλ ]B ⊆ X}[15]. It explains the relationship between the granularity and fuzzy lower approximation. The proof of the theory is explained in [15]. Let λ = RA X(xi ) , if λ ≤ RB X(x), then [xλ ]B ⊆ X. Similarly, if λ > RB X(x), then [xλ ]B ⊂ X. It indicates that [xλ ]B with λ ≤ RB X(x) is a fundamental fuzzy granularity with respect to B to characterise the inner structure of X. Therefore, it uncovers the granularity structure of the fuzzy lower approximation of X.

Proof: Given a fuzzy decision system (U, A, D), a granularity λ. For any C ⊆ B ⊆ A, it is easy to get [xiλi ]C ≥ [xiλi ]B ≥ [xiλi ]A , under the λ-granule. Thus, in each part of DK , Eλ (C|DK ⊕ A) ≥ Eλ (B|DK ⊕ A) ≥ Eλ (A|DK ⊕ A) holds. Finally, Eλ (C|U ⊕ A ⊕ D) ≥ Eλ (B|U ⊕ A ⊕ D) ≥ Eλ (A|U ⊕ A ⊕ D) = 0 is true for any attribute set satisfied with C ⊆ B ⊆ A. The monotonicity of the λ-PDE indicates that the smaller the Eλ (B|U ⊕A⊕D) is, the more powerful the discriminating ability of B for D is. Only by the monotonicity, can the algorithm gets the proper result.

B. λ-partition differentiation entropy and the property Given a fuzzy information system (U, A, D), a combination of PDE and fuzzy granularity entitled λ-partition differentiation entropy (λ-PDE) is proposed in this paper. Such information entropy is to find a subset B ⊆ A which has a similar identification ability compared with decision attribute set D. The λ-PDE of B in the information system is defined as Eλ (B|U ⊕ A) = −

C. λ-partition differentiation entropy feature selection According to the above analysis, the algorithm presents a measure to select sub-information systems from the raw decision information system in foremost step. By this algorithm, the raw information system S = (U, A, D) can be partitioned into k sub-systems Dk = (A, Dk ), (k = 1, 2, ..., r) for next steps. It is presented for feature selection under the threshold . With a specified threshold, the smallest value of λ-PDE indicates the largest similar significance. Particularly, if a large number of features are present in the data set, selecting a given number of attributes with a comparatively small value λ-PDE can be computed in a short time. the whole algorithm can divide into the following five steps: 1) The standardization of original dataset. The attribute value of the instances are normalised at first. 2) The partition of the dataset. For a decision information system, the original set is partitioned into some sub-systems under the label features. 3) Computing the similarity. The similarity are computed for sub-decision systems on both whole attributes set A and subattributes set B. Getting the attributes’ part λ-PDEs involves in similarity computing. 4) Finding the part λ-PDE. Computing the part λ-PDE on each Dk (k = 1, ..., r) with respected to different candidate condition sets, which shows the difference of whole description power between U/B and U/A. 5) Computing the whole λ-PDE. The λ-PDE is in effect representing the degree of the knowledge granularity difference among the feature subset B and the whole attribute set A. On the basis of the previous information entropy or the positive region, the time complexities of fuzzy lower approximation-based feature selection (FRFS) [16], λcondition entropy feature selection (LCE-FRFS) [15], fuzzy entropy feature selection (FEFS) [22] are all O(|U |2 ), because the computation for the similarities of the examples is one by one. However, λ-PDE does not need to calculate the similarity among the instances which are in the same decision classes any more. It assumes that there are r decision classes in a dataset and |U |/r instances in every decision class on average. And as expected, the time complexity can be reduces to O(|U |2 /r). In particular, for the data set having a great number of decision labels, LDE-FRFS will perform more notable than the convention feature selection methods.

||[xiλi ]A | ∩ |[xiλi ]B || 1  , (8) log2 |U | i |[xiλi ]B | x ∈U

where

 [xiλi ]B (xj ) =

0, λi ,

1 − RB (xi , yj ) ≥ λi ; 1 − RB (xi , yj ) < λi ,

(9)

is the fuzzy granule of xi with respect to B and λi = RA [Xi ]D (xi ). Eq. (8) can be simplified to Eλ (B|U ⊕ A) = −

|[xi ]A | 1  log2 iλi . |U | i |[xλi ]B |

(10)

x ∈U

Given a fuzzy decision system (U, A, D), U is divided into r disjoint subsets {D1 , D2 , ..., Di , ..., Dr }, where any Di is [x]D . [x]D is the decision class to which x belongs. In addition, the membership function of the decision class [x]D is  1 , y ∈ [x]D ; (11) [x]D (y) = 0 , otherwise. This paper calls Eλ (C|U ⊕ A ⊕ D) = −

r |DK |  Eλ (B|U ⊕ A) |U |

(12)

K=1

is the λ-PDE of B in the fuzzy decision information system (U, A, D), where E(B|DK ⊕ A) is the λ-PDE of B in the sub information system (DK , A). Eq. (12) denotes the average λPDE between B and A about the description of the system. Property (Monotonicity): In a fuzzy decision system (U, A, D) with U = {x1 , x2 , ..., xn }. And assuming that B and C are two subsets of the condition features with C ⊆ B ⊆ A, it is held that, Eλ (C|U ⊕ A ⊕ D) ≥ Eλ (B|U ⊕ A ⊕ D) ≥ Eλ (A|U ⊕ A ⊕ D) = 0. 3

1224

Algorithm 1 λ-PDE Fuzzy-Rough Feature Selection Require: U : the universal set of examples; A: the conditional feature set; D: the decision feature set; F : mapping from A × U to V ; V : the attribute values;

attributes are taken into account in this experiment. The summary of data sets are presented in I. TABLE I B ENCHMARK DATA Dataset Arrhythmia Cleveland Ecoli Heart Liver Olitos Satellite Sonar W isconsin

Input: Original dataset S = (U, A, D); Output: Reduct of conditional attribute R; //Standardised original dataset Sd