IMPUTATION OF MISSING DATA USING BAYESIAN ...

2 downloads 0 Views 246KB Size Report
ANALYSIS ON TEC IONOSPHERIC SATELLITE DATASET. Dr.P. ... to affect the propagation of radio waves. Estimation of ... of sight between the receiver and a GPS satellite in a ... [xo; xm]. b) Calculate the distance between the xo and all.
IMPUTATION OF MISSING DATA USING BAYESIAN PRINCIPAL COMPONENT ANALYSIS ON TEC IONOSPHERIC SATELLITE DATASET Dr.P.Subashini

Ms.M.Krishnaveni

Department of Computer Science Avinashilingam University for Women Coimbatore,TamilNadu,India ABSTRACT The ionosphere is defined as a region of the earth's upper atmosphere where sufficient ionisation can exist to affect the propagation of radio waves. Estimation of missing data of ionosphere total electron content (TEC) are crucial and remain as a challenge for GPS positioning and navigation system , space weather forecast, as well as many other Earth Observation System. There are number of alternate ways of dealing with missing data, and this research work is an attempt to approach on BPCA (Bayesian Principal Component Analysis). The work also focuses on comparison with k-nearest neighbor and error rate is been measured accordingly. The evaluation is carried out in satellite dataset which is used to predict total electron content in ionosphere. From the experimental results, it shows that BPCA could well estimate the missing data and converge good short term performance on taken dataset. Index terms-Imputation, Total Electron Content, BPCA, K- nearest neighbor, NRMSE. 1. INTRODUCTION The Total Electron Content (TEC) is used to indicate the ionisation of the ionosphere[12]. It is a quantity that concern for predicting space weather effects on telecommunications, improving the accuracy of satellite navigation, fly control vehicles and other systems that use trans ionospheric signals, because the ionospheric layer affects the mentioned signals[13]. Figure 1 represents TEC and typical electron density profile. Total amount of electrons along a particular line of sight between the receiver and a GPS satellite in a column of 1m cross sectional area and represents a typical quantitative parameter of interest to GPS users[2]. TEC is therefore the integral of the electron density profile from the ground to an infinite height. Prediction of Total Electron Content consists of several steps, namely: preprocessing, imputation of missing data, select a predictive model, parameter estimation, comparison of predictive model, result analysis [8]. By

literature [3], the KNN is been extremely used for imputation which shows disadvantages and its is solved my using BPCA here.

Figure 1. TEC representation and typical electron density profile Here this research work concentrates in imputation which is implemented with two models BPCA and KNN. The result from the experiments proves that BPCA is nearing the right approximated missing values than the compared method. The paper is organized as follows. Section deals with the analysis on missing values and imputation. Section 3 explains KNN and its disadvantages. Section 4 explains BPCA and its advantages in imputation. Section 5 explores about the dataset and the experimental results and section 6 concludes with the research work findings and future extension. 2. MISSING VALUES ANALYSIS AND IMPUTATION Missing values are an issue in a substantial number of statistical analyses [1]. For example, in the case of surveys, rarely missing values occur completely at random [5]. When nonrespondents differ systematically from respondents, adjusting for nonresponse may be required [6]. There are three types of missingness: Missing Completely At Random, Missing at Random, Non-Ignorable missingness. This research work focuses on MAR in which missingness does not depend on the true value of the missing variable, but it might depend on the value of other variables that are observed [11].

The missing values are not randomly distributed across all observations, rather they are randomly distributed within one or more subsamples[4].Imputation involves replacing an incomplete observation with complete information based on an estimate of the true value of the unobserved variable and single imputation is considered for the experimentation carried with the dataset[3]. Missing value imputation methods are usually compared in terms of RMSE, not in terms of their effect on high level analysis. 3. K-NEAREST NEIGHBORS BASED IMPUTATION KNN based imputation is a standard missing value imputation method which takes the advantage of the correlation structure in the datasets[7] .Accordingly, the imputation process is typically divided into two steps. The algorithm is as follows:1). Divide the data set D into two parts. Let Dm be the set containing the instances in which at least one of the features is missing. The remaining instances will have the complete feature information form in which the set is called Dc. 2). For each vector x in Dm: a) Divide the instance vector into observed and missing parts as x = [xo; xm]. b) Calculate the distance between the xo and all the instance vectors from the set Dc. Use only those features in the instance vectors from the complete set Dc, which are observed in the vector x. In principle, KNN imputation works much better than the other traditional methods (i.e. row average, median average) but it requires to have enough complete patterns (patterns with no missing values) in the data set to be confident of finding the correct neighbors of the patterns with missing values[10]. This is a very time consuming process and it can be very critical in data mining where large databases are analyzed. The choice of k, the number of neighbors, produces deterioration in the performance of the classifier after imputation due to overemphasis of a few dominant instances in the estimation [3] process of the missing values. 4. BPCA FOR IMPUTATION It is a transform based method that uses the probabilistic Bayesian theory to impute the missing values[2]. The dataset used will be taken as matrix with the representation of Y. BPCA divides the data set into two sets (complete and non-complete). It estimates the missing values using the observed values without missing values. The PCA is calculated using Bayes theorem and the Bayesian estimation calculates posterior distribution of model parameter θ and input matrix X containing samples using formula (1)

p (θ , X

| Y )α p ( X , Y

| θ ) p (θ )

………….(1)

where p( θ ) is called as the prior distribution which denotes a priori preference to θ and X.BPCA takes advantage of the global correlation in the data sets, and thus, has the advantage of prediction speed incurring a computational complexity[9]. 5. DATA SETS AND EXPERIMENTAL RESULTS The results of processing each of the taken data sets containing missing data using each of the above methods is summarized below.. As would be expected, the control where the actual data was left in had the best accuracy. The use of BPCA to impute the values was best than the next applied method, KNN. A matrix is choosen by focusing both rows and columns. The experiment took linearly classifiable data as the harder data set which allows to see differences in the performance. The number of positive and negative training examples and test cases were controlled explicitly. The number of attributes was also explicitly stated for each set. Performance is evaluated based on error rate,percentage error and time taken for exceution. Table 1 : Metrics taken for evaluation S.N

Metric Quantitative

Abbrevi -ation

Description

1

Mean Absolute Error

MAE

2

Mean Absolute Percent Error

MAPE

3

Root Mean Squared Error

RMSE

Average Distance from the Actual Value Average Relative Distance from the Actual Value Weighted Distance from the Actual Value

The first metric used to assess the accuracy of estimation are the NRMSE and it is shown in eqn (1)

NRMSE =

mean[( y guess − y ans ) 2 ] std [ y ans ]

…….(1)

subject to yguess is estimated value and yans is the actual value and std [ y ans ] is standard deviation of the actual values The second metric is the Mean absolute percentage error (MAPE).It is a measure of accuracy in a fitted time series value in statistics, specifically trending. It

usually expresses accuracy as a percentage, and is defined by the eqn (2):

M =

1 n At − Ft ∑ A ……………..(2) n t =1 t

where At is the actual value and Ft is the estimated value.The difference between At and Ft is divided by the actual value At again. The absolute value of this calculation is summed for every fitted or predict point in time and divided again by the number of fitted points n. This makes it a percentage error so one can compare the error of fitted time series that differ in level. The third metric is the MAE which is the given in the following eqn (3)

Time Taken 0.4 0.35 0.3 0.25 time in 0.2 (ms) 0.15 0.1 0.05 0

BPCA KNN

dataset1

dataset2

Datasets taken

Figure 3 :Time taken for Execution

1 n Ei = ∑ P( ij ) − T j ……………..(3) n j =1 where

P(ij ) is the predicted value for dataset and Tj is

the target values. The MAE measures the average magnitude of the errors in a set of estimates, without considering their direction. It measures accuracy for continuous variables. MAE is the average over the verification sample of the absolute values of the differences between forecast and the corresponding observation. The result in figure 2, 3 gives the objective evaluation of the outcomes which states that the error values and the time taken is less in BPCA when compared with KNN. Figure 4,5,6,7 represents the graphical terms of the experimental observation with the datasets taken. Figure 4 depicts the original data; Figure 5 depicts the data matrix with missing values. Figure 6 and 7 is the representation of data imputation using BPCA and KNN. It concludes that BPCA is efficient by taking the advantage of less error rate and less computational time.

Figure 4: Original Satellite data

Error performance metrics

ERROR RATE MAPE BPCA

MAE

KNN

NRMSE 0

0.2

0.4

0.6

0.8

1

1.2

Error range

Figure 2 : Error related performance metrics for dataset1

Figure 5: Data with missing values

function which in turn reduces the computational complexity. REFERENCES [1]. Brevern, A., Hazout, S., Malpertuy, A.: Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinformatics, Vol. 5. (2004) [2]. Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K., Ishii, S.: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, Vol. 19, 20882096. (2003) [3]. Fredrik A. Dahl, Convergence of random k-nearestneighbour imputation Computational Statistics & Data Analysis 51 (2007) 5913 – 5917. (2001) Figure 6: Imputation based on BPCA

[4]. Bo, T., Dysvik, B., Jonassen, I.: LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Research, Vol. 32 (2004) [5]. Ouyang, M., Welsh, W., Georgopoulos, P.: Gaussian mixture clustering and imputation of microarray data. Bioinformatics, Vol. 20, 917-923. (2005) [6]. Hair, J., Black, W., Babin, B., Anderson, R., Tatham, R.: Multivariate data analysis. 6th edn. Pearson Education, Inc. (2006) [7]. Batista G. E. A. P. A. and Monard, M. C. K-Nearest Neighbour as Imputation Method: Experimental Results. Tech. Report 186, ICMC-USP, (2002). [8]Qin, Y.S., Zhang, S.C., Zhu, X.F., Zhang, J.L. and Zhang, C.Q. Semi-parametric Optimization for Missing Data Imputation. Applied Intelligence, 27(1): 79-88, (2007).

Figure 7: Imputation based on KNN

6. CONCLUSION These two methods explained are used and it is evaluated by ionosphere data sets. The results obtained using this method showed marked improvement in estimation performance. Each algorithm was tested under taken data sets. However, to validate the performance of each algorithm, more test experiments are needed to be conducted. In future, to make the algorithm more reliable, the same data sets should be used to run the experiments. Therefore the research work concludes that BPCA will achieve best performance than KNN and considered as complement algorithm when applied to larger datasets. Further development of this work can be done based on optimatization techniques like bio inspired computing that automatically chooses the parameters of the

[9]. Truxillo, C.. Maximum Likelihood Parameter Estimation with Incomplete Data, SAS Users Group International Conference, Philadelphia PA, April 10-13, (2005) [10].UDC 004.423, DOI: 10.2298/csis0902165H “Microarray Missing Values Imputation Methods: Critical Analysis Review [11]. Little, R. J. and Rubin, D.B. Statistical Analysis with Missing Data. Second Edition. John Wiley and Sons, New York, (2002). [12] Rajat Acharya , Bijoy Roy,. Kalman Filter Approach for Prediction of Ionospheric Total Electron Content. In proceedings International Conference on Computers and Devices for Communicatio, (2009). [13] LI Shuhui and PENG Junhuan,. Ionospheric TEC Prediction and Analysis Based on Phase Space Reconstruction .National High Technology Research and Development Program of China, (2010) .