Examination of Reliability of Missing Value Recovery in Data Mining

3 downloads 764 Views 193KB Size Report
Abstract—Missing data imputation is an important task in cases where it is crucial to use all available data and no discard records with missing values. However ...
2014 IEEE International Conference on Data Mining Workshop

Examination of reliability of missing value recovery in data mining

Shigang Liu

Honghua Dai

School of Information Technology, Deakin University Melbourne, VIC 3125, Australia [email protected]

School of Information Technology, Deakin University Melbourne, VIC 3125, Australia [email protected]

the dangers of simply removing cases using the case deletion[6, 7]. The next one is to learn without handing of missing values such as learning Bayesian Networks[8], Artificial Neural Networks [9, 10] and missing values fish-net learning algorithms [11]. While missing data imputation is totally different from the other two, which means filling the missing values before a learning application. A number of methods were developed for the imputation of the missing values in last few decades, some of them are reviewed [5] and compared by Judi Scheffer[12]. And it is popular advocated that a missing datum recovery will make sense if there are some complete instances is exist in a small neighbourhood of the missing datum, otherwise, the imputed results don’t work. From another aspect the current commonly used methods that used in imputing missing values include parametric method such as linear regression[6, 9, 13] and nonparametric methods such as kernel imputation[4, 14] and Nearest Neighbour method(NN)[15]. According to the author [4] The parametric method such as linear regression, ExpectationMaximization(EM) which is a classical parametric method [5] that can lead to bias because it is often impossible to have a good understanding of the distribution of the data set in real application. In this case, nonparametric imputation[14, 16] approach would achieve a better result by capturing the structure of the data set. Some researcher divide the imputation methods as statistical area imputation and machine learning area imputation. Precisely, statistics-based models such as single, Hotdeck and multiple imputation (MI) [6] are also widely used when the data set are adequately modelled. For example, Pérez [17] presented these methods together with mean approach to impute missing data in the construction of a scoring system for predicting death in ICU patients. Imputation methods from machine learning field are based on the construction of a predictive model to estimate absent values from the information available in the data set. Approaches such as multi-layer perceptron(MLP), k-nearest neighbours(KNN), self-organising maps(SOM) and decision tree (DT) construction algorithms are commonly used. The author[18] compared some of them in real breast cancer problem, the results showed KNN has a better result in terms of prognosis accuracy (AUC). Besides, there are also different algorithms used in other areas. For example, the LSimpute method applies linear regression between the gene with missing value and neighbouring genes[19]. Local least squares(LLSs) imputation uses multiple regression of the gene with missing

Abstract—Missing data imputation is an important task in cases where it is crucial to use all available data and no discard records with missing values. However, most of the existing algorithms are focused on missing at random (MAR) or missing completely at random (MCAR). In this paper, an information decomposition imputation (IDIM) algorithm using fuzzy membership function is proposed for addressing the missing value problem which is not missing at random (NMAR). The reliability of missing value recovery at not missing at random is examined. Firstly, this paper will discuss the proposed IDIM algorithm with detailed examples. Then, the reliability of the proposed approach is evaluated with extensive experiments compared with some typical algorithms, and the results demonstrate that the proposed algorithm has a higher accuracy rate than the exiting imputation methods in terms of normal root mean square error (NRMSE) and predictive accuracy at different set of data with missing values, which shows our method is more reliable in imputing missing values. Keywords-Missing data imputation; not missing at random; information decomposition

I. INTRODUCTION Data preparation is a key step in data mining applications; a minor data quality adjustment may bring higher effectiveness, which will significant increase the validity and quality of the discovered knowledge. It is said that the data preparation takes approximately 80% of the total data engineering effort[1]. This could be worse when there are missing data in the whole data set. Especially, when the data is not missing at random. This paper particularly research on missing data that is not missing at random for incomplete data. Importantly, the proposed algorithm can be used to maintain the data quality and service for daily life issues. For example, when doing a questionnaire, low income people are reluctant to fill their salary, in this case the proposed algorithm can help researchers to obtain a complete data set and thus get a better analysis results. A. Related work Missing values imputation is an actual yet challenging issue confronted in machine learning and data mining[2]. The author[3, 4] generally classified the methods for missing data recovery into three kind of categories: 1) case deletion, 2) learning without handing of missing values, and 3) missing values imputation[4]. Obviously, case deletion is to omit those instances with missing attribute-values and use the remaining instances for learning[5, 6]. However, this method may miss some important information, and several works have demonstrated

978-1-4799-4274-9/14 $31.00 © 2014 IEEE DOI 10.1109/ICDMW.2014.84

306

value against the neighbouring genes[20]. Bayesian principal component analysis (BPCA)[21] is shown to perform exceptionally well[22, 23]. However, in the case where all values are missing for a time point (entire column), the BPCA algorithm will not work. Moreover, we have to note that all of above imputation methods are based on the missing mechanism of missing at random such as EM, MI, Hotdeck, KNN or MCAR such as listwise deletion, mean imputation.

terminologies are in subsection 2.2. And the proposed algorithm is described in subsection 2.3. A. Problem definition In this section, we formally define the query problem on multivariable incomplete data set. Let X be a real-valued m times n matrix data object, which means there are m instances and n attributes. Let X = ( x1 , x2 ," , xn ) where xi (1 ≤ i ≤ n) is an attribute of the

data values for the ith column of X . A multivariable data object X is lower incomplete, if it is satisfies 1) at least one of its data elements is missing; 2) the missing data elements in each column is/are less than the maximum of the column data elements. Take xi for example,

B. Contribution Our motivation is to propose an algorithm that will be able to perform well and more reliable in terms of NRMSE and predictive accuracy when there are missing values which is not missing at random in a multivariable data set. Our main contribution are summarized as follows: 1. To the best of our knowledge, missing data which is not missing at random is a really long outstanding issue, especially when faced with multivariable data. It is said data not missing at random could potentially be a strong biasing influence and not ignorable[24]. So the paper poses a new technical challenges to traditional query methods. We proposed an IDIM method, which can achieve a better results during iterative steps. 2. We introduce an efficient algorithm to impute the missing values faced with multidimensional incomplete data. With the help of fuzzy membership function, we can successfully transfer the information carried in observed data to the missing values. Moreover, the nearest neighbour approach can help the algorithm find more information when the fuzzy membership cannot transfer information to the missing values, which are relatively complemented to speed up in finding the missing data. What’s more, the fuzzy membership function can be changed, this provides guidance for the users to effectively determine the fuzzy function according to the different situation and their recovery quality preference. 3. Our algorithms is designed to make use of the original data set, which can be used to test the reliability iterative times. 4. We conducted extensive experiments on data sets from UCI, the results demonstrated the effectiveness and reliability of our approach in terms of competitive accuracy. The rest of the paper is organized as follows: The details of our proposed algorithm is presented in section 2. Section 3 is devoted to experiments results and analyses. Section 4 concludes the paper. II.

we know xi (1 ≤ i ≤ n) is an attribute of the data values for the

ith column of X , let mik (1 ≤ k < m ) denotes the missing values in xi , we say X is lower incomplete if max(mik ) < max( xi ) for 1 ≤ k < m . This problem is common is medical, biological or even daily life, for example, in a questionnaire people with low incomes were less likely to report their family income than people with higher incomes. In this paper, we focus on this particular problem in terms of continuous missing values. B. Basic terminologies for the algorithm In this subsection, we will have a general review of fuzzy membership functions and information distribution. 1) Fuzzy membership functions The membership function of a fuzzy set is a generalization of the indicator function in classical sets. In fuzzy logic, it represents the degree of truth as an extension of valuation. Membership functions were introduced by Zadeh in the first paper on fuzzy sets [25]. Definition: Fuzzy membership functions For any set X , a membership function on X is any function from X to the real unit interval [ 0,1] . The

 is membership function which represents a fuzzy set A usually denoted by μ A . For an element x of X , the value

μ A ( x ) is called the membership degree of x in the fuzzy set A . The membership degree μ A ( x ) quantifies the grade of membership of the element x to the fuzzy set A (Zadeh L.A 1965). Denote that the value 0 means that x is not a member of the fuzzy set; the value 1 means that x is fully a member of the fuzzy set. The values between 0 and 1 characterize fuzzy members, which belong to the fuzzy set only partially (wikipedia). There are many kind of membership functions, for example triangular function, trapezoidal function, Triangular function, Gaussian function and so on. The following are two kind of membership functions: Trapezoidal function:

PROBLEM DEFINITION AND IDIM ALGORITHM

In this section, we introduce and describe the methods applied to impute the multivariable incomplete data set. Before presenting the new imputation algorithm in Section 2.3 the work which is information decomposition imputation in one-dimensional data set and the work which is about information distribution are recalled. Precisely, this section is organized as follows. First, what kind of problem we are going to solve is defined on subsection 2.1. Then, some basic

307

x ≤ a or x ≥ d ­ 0 ° x−a ° a≤ x≤b ° (1) μ A ( x) = ® b − a b≤x≤c 1 ° ° d−x c≤x≤d ° ¯ d −c Triangular function, defined by a lower limit a , an upper limit b , and a value m , where a < m < b : 0 ­ ° x−a ° ° μ A ( x) = ® m − a °b−x °b − m ° 0 ¯

­1 − xi − u j / h if xi − u j ≤ h °

μ ( xi , u j ) = ® °¯

x≤a

(3)

example. Let xi = ( xi1 , xi 2 ,", xim ) ', i = 1, 2,", n be a given sample,

a< x≤b

(2)

R is the universe of discourse of xi , and U i = {ui1 , ui 2 ," , uit }

m< x h

Where h is called step length and μ is called linear distribution. Obviously, μ satisfies all properties of an information distribution function. In order to make our algorithm clear to understand, in this subsection, we particularly take one column xi = ( xi1 , xi 2 ," , xin ) ', i = 1, 2," , m of X = { x1 , x2 ," , xn } as an

2) Information distribution The author [26] discussed information distribution for probability estimation, the paper experimentally showed that information distribution performs better in obtaining soft histogram than the classical histogram which can reduce the error at about 23.2% . Moreover, the mothed of information distribution can raise the work efficiency, i.e., a small sample acts as a larger one for estimating. Which could say information distribution can make good use of the data set’s distribution.

3) Conserved. That is to say

0

j

j =1

i = 1, 2," , n . Definition 1-Dimension Linear Information Distribution[26] Let X = {xi | i = 1, 2," , n} be a given sample, R is the

universe of discourse of X , and U = {u1 , u2 ," , um } is the discrete universe of X , where u j − u j −1 ≡ h, j = 2,3," , m. For

universe

xi ∈ X , and u j ∈ U , the following formula is called 1-

of

discourse

of

X,

A = [a, b],

where

ai = min{xij | j = 1,2," , m} and bi = max{xij | j = 1,2," , m}.

dimensional linear-information distribution:

t is the settled number of the intervals that A = [a, b] is being

308

mi 22 = μ ( xi 2 , ui 2 ) ∗ xi 2 = (1 − 5 − 5.8 1.6) ∗ 5 = 2.5 ,

divided, usually t is the number of missing values. That is the step

length h = (b − a) t ,

t

A = * Aj

mi 23 = μ ( xi 2 , ui 3 ) ∗ xi 2 = 0

and

j =1

mi 31 = μ ( xi 3 , ui1 ) ∗ xi 3 = 0 , mi 32 = μ ( xi 3 , ui 2 ) ∗ xi 3 = 0

Aj = [a +(j - 1) ∗ h, a + j ∗ h] . U = {u1 , u2 ," , ut } is the discrete universe

of

R

where

u j − u j −1 ≡ h, j = 2,3," , t

mi 33 = μ ( xi 3 , ui 3 ) ∗ xi 3 = (1 − 8.2 − 7.4 1.6) ∗ 8.2 = 4.1 ,

and

mi 41 = μ ( xi 4 , ui1 ) ∗ xi 4 = 0

u j = ( a +(j - 1) ∗ h + a + j ∗ h) / 2 , that is to say u j is the center

mi 42 = μ ( xi 4 , ui 2 ) ∗ xi 3 = (1 − 6 − 5.8 1.6) ∗ 6 = 5.25 ,

of Aj . For μ ( xi , u j ) is obtained from formula (3), xi ∈ X , and u j ∈ U ,

mi 43 = μ ( xi 4 , ui 3 ) ∗ xi 3 = (1 − 6 − 7.4 1.6) ∗ 6 = 0.75 ,

mij obtained from formula (5) is called 1-

Because xi1 ∈ Ai1 , xi 2 , xi 4 ∈ Ai 2 , xi 3 ∈ Ai 3 , therefore the recovered missing values are: m + mi11 m + mi 42 m i1 = i 21 = 2.1 , m i 2 = i 22 = 3.875 , 2 2 m i 3 = mi 43 = 0.75 . Remarks: if we choose different linear distribution μ , different kind of values will be obtained. 2) Information decomposition imputation algorithm In the following discussion, the detailed steps about how information decomposition method used in missing data recovery will be introduced. First of all, we would like to note that this paper focus on continuous variable on lower incomplete data set. Let X = ( xij ) m× n , m ∈ R, n ∈ R be an incomplete data set in

dimensional linear information decomposition from xi to Aj . mij = μ ( xi , u j ) ∗ xi

(5)

Where h is called step length and μ is called linear distribution. Remarks: The map μ information distribution is chosen from fuzzy membership functions. In order to make our algorithm clear to understand, in this subsection, we particularly take one column xi = ( xi1 , xi 2 ," , xin ) ', i = 1, 2," , m of X = { x1 , x2 ," , xn } as an example. Let xi = ( xi1 , xi 2 ," , xin ) ', i = 1, 2," , m be a given sample, R is the universe of discourse of xi , Ai = [ai , bi ], where ai = min{xij | i = 1,2," , m} and bi = max{xij | j = 1,2," , m}.

its columns. In another way, we denote X = ( x1 , x2 ,, xn ), n ∈ R , and xi = ( xi1 , xi 2 ," , xim ) ', i = 1, 2," , n.

ti is the settled number of the intervals that Ai = [ai , bi ] is being divided, usually ti is the number of missing values in

In the following part let’s take xi has missing values for an

t

example. Let xi = { xij | j = 1, 2," , m} be a data set with

xi . That is the step length hi = (bi − ai ) ti , Ai = * Ais and s =1

Ais = [ai +(s - 1) ∗ hi , ai + s ∗ hi ] . U i = {ui1 , ui 2 ," , uit } is the

missing values; the number of missing values is ti , the missed

discrete universe of R where uis − uis −1 ≡ h, s = 2,3," , ti . and

values denoted as {mis | s = 1,2," ,t} .

uis = (ai +(s - 1) ∗ hi + ai + s ∗ hi ) / 2 , that is to say uij is the

Let ai = min{xij | j = 1,2," ,m} ; bi = max{xij | j = 1,2," ,m} .

center of Aij . For μ ( xij , uis ) is obtained from formula (3),

Then we get an interval [ ai , bi ] .

xij ∈ xi , and uis ∈ U i , mij obtained from formula (6) is called

Let hi =

1-dimensional linear information decomposition from xij to Ais . mijs = μ ( xij , uis ) ∗ xij

bi − ai , Ais = [ai + ( s − 1) ∗ h, ai + s ∗ h), s = 1,2," ,ti . ti a + ( s − 1) ∗ h + a + s ∗ h , then we find out the 2 xij ∈ Ais ( j = 1, 2," , m) and we get

And we get uis =

(6)

number

For example, xij = (3.4,5,8.2, 6) ', ti = 3 (which means three

of

{xil | l = 1, 2," , qi , and qi < m} = Ais  X . To clarify, we denote Yiqi = { yl | l = 1, 2," , qi , and qi < m} = {xil } = Ais  X ,

missing values), then we get xi1 = 3.4, xi 2 = 5, xi 3 = 8.2, xi 4 = 6 , ai = 3.4 , bi = 8.2 , Ai = [3.4,8.2], ui1 = 4.2 , ui 2 = 5.8 , ui 3 = 7.4 h = 1.6 and Ai1 = [3.4,5), Ai 2 = [5,6.6), Ai 3 = [6.6,8.2). Therefore, 1-dimensional linear information decomposition from xi1 to Ai1 is:

qi

then we get yi =

¦y

l

l =1

i = 1, 2," , t. qi Of cause, we can use the linear distribution from formula (4). Here, we choose another one, which is from the Trapezoidal function:

mi11 = μ ( xi1 , ui1 ) ∗ xi 2 = (1 − 3.4 − 4.2 1.6) ∗ 3.4 = 1.7

1-dimensional linear information decomposition from xi 2 to Ai1 is: mi12 = μ ( xi1 , ui 2 ) ∗ xi1 = 0 Similarly, we can get: mi 21 = μ ( xi 2 , ui1 ) ∗ xi 2 = (1 − 5 − 4.2 1.6) ∗ 5 = 2.5 , mi13 = μ ( xi1 , ui 3 ) ∗ xi1 = 0

309

( xij < ais ) or ( xij > bis ) ­ 0 °x −a ° ij ais ≤ xij ≤ ais + his °° his (7) μ ( xij , uis ) = ® 1 ais + his ≤ xij ≤ ais + 2his ° ° d − xij ° ais + 2his ≤ xij ≤ bis °¯ his b − ais Where his = is , ais = ai + ( s − 1) * hi , bis = ai + s * hi 5 Then we can calculate mijs ,finally we get the i th missing

INPUT: the incomplete data X OUTPUT: the imputed complete data X p % The first imputation for i = 1: n each xi ∈ X , t

Ai = * Ais ; U i = {ui1 , ui 2 ," , uit } ; Ais = [ai + ( s − 1) ∗ h, ai + s ∗ h), s = 1,2," ,ti ; s =1

a + ( s − 1) ∗ h + a + s ∗ h ; mijs = μ ( xij , uis ) ∗ xij ; 2 if all mijs = 0 uis =

if nearest neighbour doesn’t work, use the mean of the column to impute

data value, which is m is . If all mijs are 0, we use the nearest neighbour method to find out one values that replace the missing value, in case the nearest neighbour approach does not work, we will take the mean of the column as the missing value. For not all of the mijs zero, the averaged values will be

end else mis = averaged nonzero mijs . end end X 1 : imputed complete data set; X r : the imputed values.

regarded as the missing value, which is described in the previous example. The following steps are used to generate the missing values for the data set with missing values: 1. Given the incomplete data set. 2. Compute Ai and U i . 3.

Compute Yiqi = Ais ∩ X ,for i = 1, 2," , n.

4.

For

each

i

% The tth(t > 1) imputation: for p=0; X = Xr ;

While p < settledvalue or NR p − NR p−1 > NRP−1 − NRP −2 Repeat steps 1-5 p = p +1 ; X = X r ;

compute μ Ais ( yl , uis ), l = 1, 2," , qi , end

and s = 1, 2," , ti . 5.

6.

mis

else Use the nearest neighbour to impute mis

Compute mijs , if all mijs =0 then use the nearest

Figure 1. The pseudocode of the proposed algorithm Remarks: NR p is the normal root mean squared error (NRMSE) between

neighbour method to find out one values that replace the missing value, if the nearest neighbour approach does not work, take the mean of the column as the missing value. Otherwise, the averaged values will be regard as the missing value. Utilize the obtained values as X, repeat the steps from 2 to 5, until the error can satisfy the settled value.

X p and X p −1 . And the settled value is the iterative times, which is usually no

more than 6.

III. SIMULATION STUDY In this section, we present experimental results on evaluating the performance of our proposed algorithm, and compare the performance of our approach with other popular methods.

Given the above imputation algorithm, we can find that the proposed imputation method is not a single-imputation method. Because single-imputation is not feasible to use it when there are only a few completed instances, because once the proposed method cannot recover the missing values, it will use nearest neighbour method. However, in this case nearest neighbour also doesn’t work, thus the mean of the column will be taken to replace the missing values, which will cause bias, and this will become more serious when the missing rate is higher. In fact, high missing rate does exist in practice, especially in industrial data set. Take the data set in [27] for example, there are 4383 missing values there, but none of the records are complete. Therefore it is important to consider how to utilize all the variable information, especially the further use of the recovered values assist in improving the imputation performance. So the proposed algorithm is designed as a nonparametric iterative imputation approach in order to achieve a better interpolation and extrapolation. The algorithm is designed in Fig.1.

A. Data sets In this paper, we will do 24 experiments based on two data sets taken from UCI [28]. The first one is Abalone data and the other one is Boston Housing Data(see table 1). None of them have missing values. In this case, we can compare the imputed data sets with the original data sets. TABLE I. Name Abalone Housing

Attri.Type 7/1/0 12/1/1

DATA SET FROM UCI #(attr.) 8 14

#(ins.) 4177 506

Each column represents the name of the database, the type of independent attribute (continuous/unordering/ordering), the number of independent attributes, and the number of instances, respectively.

During the experiments, the first 6 columns of the data sets which contains continuous variables will be used for the experiments. Each data set have two different kinds of missing patterns. Take Abalone data set as an example, let X = ( x1 , x2 ," , x8 ) as the ‘complete’ matrices of the original data set, and xi = ( xi1 , xi 2 ,", xi 4177 )', i = 1,2,",6 is the

310

Denote Ai = max( xi ), Bi = min( xi ) , ti = ( Bi − Ai ) 5 . Then, the data in each column that less than Bi + ti and

1) NRMSE evaluation By imputing the missing values at different kinds of missing patterns, we compared the performance of the seven methods by finding the NRMSE. The results in tables 3-6 show the performance comparison of the existing algorithm and our proposed method. Our method has lower NRMSE values for all the situations tested except for REM method on Abalone ( Bi + ti ) in column 2 and LLS approach on Abalone ( Bi + 2ti ) in column 1 and 2. This because there are only one attribute has very few missing values and the missing values are similar, even some missing values are the same number. For case when a few columns have large amount of missing data values, the missing values are not imputed using BPCA and LLS(Housing ( Bi + ti )). And when one of the column nearly contain 100% missing values, we can see from Abalone ( Bi + ti ) that REM and Mixture of kernel imputation do not function. Comparatively, mixture of kernel impute method even worse than mean imputation in terms of this defined problem, while our proposed algorithm works exceptionally well in all of these situations with low NRMSE, which means the results are more reliable.

attribute.

Bi + 2ti are deleted. Both of the Abalone and Housing data set have the same missing strategy. Table 2 shows the detailed information about the meaning of the number in the experiments: TABLE II. SN 1 2 3 4 5 6

MISSING VALUE STRATEGY

Missing pattern The first continuous variable column has MV The first 2 continuous variable columns have MV The first 3 continuous variable columns have MV The first 4 continuous variable columns have MV The first 5 continuous variable columns have MV The first 6 continuous variable columns have MV

MV: missing values; SN: strategy number

B. Method compared and parameter setting Six other methods which we mentioned before were used in the simulation study. They are included the Regularized Expectation-Maximization (EM)[29], the mixture of Kernel imputation method[4], KNNimpute[30], LLSimpute[20], BPCAimpute[21] and the mean impute method. We adopt the KNNimpute algorithm that is available from the bioinformatics toolbox in Matlab 7. All other algorithm codes are available for downloading from the researchers’ respective Websites. In the following table the results arrived on a Window 8 laptop equipped with Core i7-2600 CPU at 3.40 GHz and 8.00 GB RAM is presented. And the matlab 7.0 and weka 3.6 is use for evaluation. The parameter values that used in this paper are for KNN, after compare the results, we choose k = 6 (which can achieve a higher accuracy), for mixture of kernel impute, we choose σ = 0.2, q = 2, ρ = 0.95 [4].

TABLE III. SN 1 2 3 4 5 6

NRMSE =

¦ ¦ ª¬ X (i, j ) − X (i, j)º¼ ¦ ¦ [ X (i, j )] n

i =1

j =1 m

n

i =1

j =1

Kernel 0.0348 0.0458 0.1224 0.4194 0.4602 0.4693

TABLE IV. SN 1 2 3 4 5 6

C. Performance Measure To measure the performance of the methods, and to show the reliable of our proposed algorithm, we employed two methods. The first one is NRMSE and the second one is predictive accuracy. Precisely, we use the normalized rms error (NRMSE) as the performance measurement for our method, calculate as m

REM 0.0091 0.0192 ---------

REM 0.0226 0.0412 ---------

Kernel 0.0724 0.0955 0.5315 0.8141 0.8806 0.8965

TABLE V. SN 1 2 3 4 5 6

2

2

Where X is the true value, X is the estimated value, and m and n are the total number of rows (instances) and columns (attributes), respectively. Then we performed numerical simulations on the data imputed by the methods, J48 is used for measuring the reliable of each methods. For each size of J48 and set of parameters, a 9-fold complete cross-validation scheme was used.

REM 0.0296 0.0293 0.0295 0.0295 0.0320 0.0319

Kernel 0.0393 0.0402 0.0403 0.0403 0.0460 0.0462

TABLE VI. SN 1 2 3 4 5 6

D. Simulation missing variables and reliable analysis In this subsection, we describe two kinds of evaluation methods respectively.

311

REM 0.0534 0.0534 0.0539 0.0540 0.0567 0.0578

Kernel 0.0556 0.0572 0.0572 0.0572 0.0676 0.0678

NRMSE RESULTS FOR ABALONE Imputation approaches KNN LLS BPCA 0.0120 0.0037 0.0051 0.0204 0.0756 --0.1782 0.2484 --0.1926 0.4903 --0.1981 0.0663 --0.1898 0.1904 ---

Mean 0.0338 0.0447 0.1219 0.3758 0.4180 0.4279

NRMSE RESULTS FOR ABALONE Imputation approaches KNN LLS BPCA 0.0490 0.0056 0.0268 0.0857 0.0756 --0.2398 0.2485 --0.3156 0.4903 --0.3600 0.0663 --0.3744 0.1904 ---

Imputation approaches KNN LLS BPCA 0.0186 0.1107 0.0528 0.0220 0.1210 0.0531 0.0278 0.1353 0.0531 0.0267 0.1345 0.0532 0.0275 0.1348 0.0559 --0.2351 0.0568

IDIM 0.0321 0.0598 0.1817 0.2671 0.2890 0.2966

( Bi + ti )

Mean 0.0296 0.0302 0.0301 0.0302 0.0351 0.0352

NRMSE RESULTS FOR HOUSING

IDIM 0.0086 0.0093 0.0407 0.1099 0.1344 0.1276

( Bi + 2ti )

Mean 0.0709 0.0938 0.5311 0.8029 0.8696 0.8856

NRMSE RESULTS FOR HOUSING Imputation approaches KNN LLS BPCA 0.0178 0.0997 0.0279 0.0186 0.1014 0.0280 0.0247 ----0.0248 ----0.0248 ----0.0295 -----

( Bi + ti )

IDIM 0.0177 0.0179 0.0179 0.0179 0.0181 0.0234

( Bi + 2ti )

Mean 0.0534 0.0543 0.0543 0.0543 0.0622 0.0623

IDIM 0.0124 0.0127 0.0203 0.0203 0.0208 0.0356

2) Predictive accuracy evaluation The UCI Abalone data set are applied to classification accuracy. Strategy 1, 3 and 5 will be chosen for measuring the evaluation results.

REM 52.83% -----

Kernel 52.54% 53.33% 53.30%

TABLE VIII. SN 1 3 5

REM 52.51% -----

REFERENCES [1]

PREDICTIVE ACCURACY FOR ABALONE ( Bi + ti )

TABLE VII. SN 1 3 5

there are huge missing values that less than the biggest value in the column.

Imputation approaches KNN LLS BPCA 52.78% 53.07% 52.75% 53.66% 53.28% --53.11% 52.13% ---

Mean 52.37% 53.28% 53.21%

IDIM 53.02% 53.94% 53.54%

[2]

PREDICTIVE ACCURACY FOR ABALONE ( Bi + 2ti )

Kernel 52.25% 52.25% 52.59%

Imputation approaches KNN LLS BPCA 52.51% 52.49% 52.39% 53.40% 53.28% --53.13% 52.13% ---

Mean 52.25% 53.21% 52.56%

[3]

IDIM 52.50% 54.05% 53.23%

[4]

[5] [6]

The predictive accuracy from the original data set is 59.97% which mean it is the predictive accuracy from the Abalone data set without missing values. Therefore, we can see that our proposed algorithm is more reliable compared with the other six methods. Precisely, the accuracy in strategy 1 is outstanding, just as what we said before that’s because there is only one attribute has very few missing values, and some of missing values are very close or even the same. Interestingly, when there are more missing data on multidimensional data set, the proposed algorithm outperform others, which can be seen from strategies 3 and 5 in table 7. The aforementioned simulation results showed that the proposed algorithm can achieve significantly less error in comparison with Regularized Expectation-Maximization, the mixture of Kernel imputation method, KNNimpute, LLSimpute and BPCAimpute in terms of the defined problem, this illustrate that our proposed algorithm is more reliable in terms of NRMSE. There are also limitations with our approach. Firstly, there is not a best linear function for all kind of data set. Secondly, the algorithm is designed from mathematic, or fuzzy mathematic perspective, in this case users should have such background in order to make good use of it. Thirdly, it cannot impute discrete data set currently. Further research work is needed to generalize it for both continuous and discrete data set.

[7] [8]

[9]

[10]

[11]

[12] [13]

[14]

[15]

[16]

[17]

IV. CONCLUSION [18]

This paper introduced an IDIM algorithm to aim at solving the defined lower incomplete problem. Generally speaking, there are two steps in our algorithm, first step is to choose the linear function, which is usually come from fuzzy membership functions. The second step is to impute the missing values by using the proposed algorithm and nearest neighbour method. We conduct extensive experimental evaluation using data set from UCI. The results indicate that our approach achieves satisfactory performance in addressing lower incomplete problem compared with the other six methods. Experiments based on NRMSE and predictive accuracy showed that the proposed method are more reliable. Our proposed algorithm may be useful for missing values estimation, especially when

[19]

[20]

[21]

[22]

312

Y. Qin, S. Zhang, X. Zhu, J. Zhang, and C. Zhang, "POP algorithm: Kernel-based imputation to treat missing values in knowledge discovery from databases," Expert systems with applications, vol. 36, pp. 2794-2804, 2009. S. Zhang, Z. Qin, C. X. Ling, and S. Sheng, "" Missing is useful": missing values in cost-sensitive decision trees," Knowledge and Data Engineering, IEEE Transactions on, vol. 17, pp. 1689-1693, 2005. Y. Qin, S. Zhang, X. Zhu, J. Zhang, and C. Zhang, "Semiparametric optimization for missing data imputation," Applied Intelligence, vol. 27, pp. 79-88, 2007. X. Zhu, S. Zhang, Z. Jin, Z. Zhang, and Z. Xu, "Missing value estimation for mixed-attribute data sets," Knowledge and Data Engineering, IEEE Transactions on, vol. 23, pp. 110-121, 2011. P. D. Allison, Missing data: Sage Thousand Oaks, CA, 2000. R. J. Little and D. B. Rubin, "Statistical analysis with missing data," 2002. D. Rubin, "Multiple imputation for nonresponse in surveys. 1987," New York, USA: John Willey & Sons. M. Ramoni and P. Sebastiani, "Learning Bayesian networks from incomplete databases," in Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, 1997, pp. 401-408. Z. Ghahramani and M. I. Jordan, "Mixture models for learning from incomplete data," Computational Learning Theory and Natural Learning Systems, vol. 4, pp. 67-85, 1997. U. Dick, P. Haider, and T. Scheffer, "Learning from incomplete data with infinite imputations," in Proceedings of the 25th international conference on Machine learning, 2008, pp. 232-239. H. Dai and V. Ciesielski, "Learning of inexact rules by the fish-net algorithm from low quality data," in Proceedings of the Eigth Australian Joint Artificial Intelligence Conference, 1994. J. Scheffer, "Dealing with missing data," 2002. A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," Journal of the Royal Statistical Society. Series B (Methodological), pp. 1-38, 1977. S. Zhang, "Parimputation: From Imputation and Null-Imputation to Partially Imputation," IEEE Intelligent Informatics Bulletin, vol. 9, pp. 32-38, 2008. C. Zhang, X. Zhu, J. Zhang, Y. Qin, and S. Zhang, "GBKII: an imputation method for missing values," in Advances in Knowledge Discovery and Data Mining, ed: Springer, 2007, pp. 1080-1087. Q. Wang and J. Rao, "Empirical likelihood-based inference under imputation for missing response data," The Annals of Statistics, vol. 30, pp. 896-924, 2002. A. Perez, R. J. Dennis, J. F. Gil, M. A. Rondón, and A. López, "Use of the mean, hot deck and multiple imputation techniques to predict outcome in intensive care unit patients in Colombia," Statistics in medicine, vol. 21, pp. 3885-3896, 2002. J. M. Jerez, I. Molina, P. J. García-Laencina, E. Alba, N. Ribelles, M. Martín, and L. Franco, "Missing data imputation using statistical and machine learning methods in a real breast cancer problem," Artificial intelligence in medicine, vol. 50, pp. 105-115, 2010. T. H. Bø, B. Dysvik, and I. Jonassen, "LSimpute: accurate estimation of missing values in microarray data with least squares methods," Nucleic acids research, vol. 32, pp. e34-e34, 2004. H. Kim, G. H. Golub, and H. Park, "Missing value estimation for DNA microarray gene expression data: local least squares imputation," Bioinformatics, vol. 21, pp. 187-198, 2005. S. Oba, M.-a. Sato, I. Takemasa, M. Monden, K.-i. Matsubara, and S. Ishii, "A Bayesian missing value estimation method for gene expression profile data," Bioinformatics, vol. 19, pp. 2088-2096, 2003. X. Wang, A. Li, Z. Jiang, and H. Feng, "Missing value estimation for DNA microarray gene expression data by Support Vector

[23]

[24] [25] [26]

[27]

Regression imputation and orthogonal coding scheme," BMC bioinformatics, vol. 7, p. 32, 2006. D. S. Wong, F. K. Wong, and G. R. Wood, "A multi-stage approach to clustering and imputation of gene expression profiles," Bioinformatics, vol. 23, pp. 998-1005, 2007. D. B. Rubin, "Inference and missing data," Biometrika, vol. 63, pp. 581-592, 1976. L. A. Zadeh, "Fuzzy sets," Information and control, vol. 8, pp. 338353, 1965. H. Chongfu, "Demonstration of benefit of information distribution for probability estimation," Signal Processing, vol. 80, pp. 10371048, 2000.

[28] [29]

[30]

313

K. Lakshminarayan, S. A. Harp, and T. Samad, "Imputation of missing data in industrial databases," Applied Intelligence, vol. 11, pp. 259-275, 1999. C. J. Merz and P. M. Murphy, "{UCI} Repository of machine learning databases," 1998. T. Schneider, "Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values," Journal of Climate, vol. 14, pp. 853-871, 2001. O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman, "Missing value estimation methods for DNA microarrays," Bioinformatics, vol. 17, pp. 520-525, 2001.