A Missing Data Treatment Method for Photovoltaic ...

0 downloads 0 Views 410KB Size Report
method of missing data performing best for all classifiers. In. [16], time series of wave ..... 6-9 present some examples on missing data filling. The test day is ...
A Missing Data Treatment Method for Photovoltaic Installations Ioannis P. Panapakidis 1 ,2 1

Dept. of Electrical Engineering Technological Educational Institute of Thessaly Larisa, Greece Abstract—Due to the high installation rate of Photovoltaics (PV) systems, the challenging task of data processing arises. As the number of intermittent PV systems grows especially in the distribution networks, this task becomes even more important. In many cases, the data present inconsistencies, i.e. missing, incomplete or profoundly wrong values. The present paper proposes a method built on machine learning algorithms for missing data completion. The original incomplete PV generation time series are filled and restored with low error. The proposed method can be easily applied in real installations with high rate of data collection and storage needs. Keywords—Clustering; incomplete data; machine learning; photovoltaics;

I.

INTRODUCTION

Globally, the growth of PV installations and the respective continuous metering of data such as generated electricity, temperature, solar irradiation and others, raise the need for the implementation of intelligent algorithms for robust data processing [1]. The scope is to study the data, built descriptive models, retrieve exploitable information and in general, draw useful conclusions after the data processing [2]. As the amount of data increases, the presence of atypical data is more possible. Atypical data may refer to outliers, vague values, erroneous values and others. The causes of atypical data may differ: Metering failures, noise interference, supply shortage, physical destruction of the equipment and others. Moreover, the absence of data is a crucial factor that brings forth obstacles in the utilization of data for specific engineering applications. For instance, an incomplete, due to absence of data, wind speed series can lead to limitations for assessing the wind speed potential of a region and therefore, a techno-economic feasibility study of a wind park may not be accurate and reliable. Also, missing data entries may lead to poor forecasting performance since the continuality of time sequence is broken. The phenomenon of absent data is known in literature as “missing data”. The missing data concept refers either to complete absence of data among data entries and sequences or incomplete data, i.e. partial presence of data. According to [3], if missing data refers to a sample less than 1% of the total, the effect is trivial. Rates among 1%-5% refer to manageable missing data sample, while for amounts larger than 5%, processing tools should be employed. The need to examine tools for missing data becomes more compelling since in many scientific fields the term “Big Data” becomes a reality [4]-[5].

Aggelos S. Bouhouras2, Georgios C. Christoforidis 2 2

Dept. of Electrical Engineering Western Macedonia University of Applied Sciences Kozani, Greece Therefore, the missing data completion problem can be an important aspect of data processing. Missing data are dealt in three different ways: (a) discard the series with missing data, (b) use of maximum likelihood procedures based on the measured data and (c) imputation of missing data with estimated ones. Usually, in measured data sets attributes relationships among attributes exist for data in different time periods. The methods that have been applied for completion of missing data mostly belong to the technical field of machine learning: K-Nearest Neighbor [3], Concept Most Common Attribute Value for Symbolic Attributes [6], Kmeans Clustering [7], Fuzzy K-means Clustering [8], Event Covering [9], Regularized Expectation-Maximization [10], Support Vector Machines [11], Singular Value Decomposition [12] and Bayesian Principal Component Analysis [13]. The work in [14] shows that the Event Covering method offers a very good synergy with Radial Basis Function Networks for missing data. Based on [15] there is no universal imputation method of missing data performing best for all classifiers. In [16], time series of wave height data with 16.50% and 33% of missing values are completed, using a method based on nonstationary modeling of long-term time series by means of simulated data from a population with the same probability law. The Data Interpolating Empirical Orthogonal is used in [17] to reconstruct missing data from satellite images, which is useful for filling missing data from geophysical fields. Concerning missing data on PV time series, the work in [18] uses an iterative multi-task learning for time series to fill in missing values in a PV system. In [19], two approaches are investigated, namely a regression tree ensemble tuned by Bayesian optimization and a simple rule that predicts the hourly averages observed in the previous year. Finally, in [20], the authors use a backfilling algorithm based on neural networks to synthesize lost data, which increases the performance ratio prediction accuracy. In this context, the scope of this paper is to introduce a new methodology for missing data filling in PV systems. This methodology is both applicable for complete and partial absence of data and is not restricted by the size of installed capacity. The data size and resolution are limitations. The proposed methodology has the potential to be applied in every type of time series, e.g. temperature, solar irradiation and others. Thus, the methodology can be a part of the data processing and analytics stage in order to restore a not continuous time series.

©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republising this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

II.

MISSING DATA CONCEPT

A. Data description and general overview The PV system under study is located in Northern Greece. The installed capacity is 9.98 kWp. It is connected to the low voltage distribution grid. The collected data cover the period 01/01/2013-31/12/2013 and refer to recorded PV generation values, measured on site. The metering interval is 10 min. The flowchart of the proposed methodology is presented in Fig. 1. Start

Data collection

Normalization

No

Yes No

Yes No Yes

Test days set selection

Filling technique application

No

Yes End

Fig. 1. Flow-chart of the proposed methodology for missing data treatment.

The methodology is composed by two general phases, namely the clustering phase and the completion phase. Clustering is an unsupervised machine learning tool with

proven robustness in problems with no external information about the possible data structure [21]. Clustering algorithms have been successfully implemented in PV data analysis problems [22]. Usually, the outputs of a clustering algorithm are the centroids of the clustering and the clustering labelling. The latter refers to the extractions of the labels that denote the membership of every data pattern in the clusters. In the proposed methodology, the important aspect of clustering is the labeling, i.e. the scope is to trace pattern sequence similarities. B. Clustering phase As depicted in Fig. 1, the normalization of data is necessary, after the data collection. This is due to the endogenous capacity of clustering; the data are grouped together based on their similarities. The magnitude of the physical units is not of practical importance. The PV data are expressed as D-dimensional vectors. The dimension refers to the number of elements that each vector (i.e. patterns) contains. Every pattern represents a daily PV generation curve. In the case of missing data, the complete pattern is not available. In the case of incomplete data, the existing pattern has a dimension less than D. It should be noted, that in the latter case, the data missing values can be random. The algorithm selection step refers to the careful selection of clustering algorithm. Apart from normalization, a data pre-processing step may be applied in order to cleanse the data from outliers and erroneous values. In many problems, a comparison of various algorithms takes place. In the present study, the selected algorithm is K-means, due to its reported effectiveness, reduced complexity and comprehensive operation [23]-[24]. The validity indicator refers to a mathematical criterion that both evaluates the algorithm`s performance and provides information about the optimal number of clusters. The optimal clustering performance refers to the minimization of the validity indicator, i.e. the condition that corresponds to minimum clustering error. If this condition is not met, the user should repeat the execution of the algorithm with different parameters, such as maximum number of iterations, threshold value of the algorithm`s objective function, improvement among successive iteration and others. Apart from considering different algorithms, different validity indicators can be examined. In this work, K-means is checked with the ratio of Within Cluster Sum of Squares to Between Cluster Variation (WCBCR) [25]. WCBCR is a measure of both separation and compactness of the output clusters. The centroids of the clusters are actually the average of the PV generation curves that belong to the same cluster. Due to its tendency for receiving lower values, it is a suitable validity indicator for tracking the optimal number of clusters. C. Completion phase The completion phase follows the clustering stage. The PV data set under study does not contain missing or incomplete data. For the purpose of testing the methodology, days with complete data are extracted from the data set to serve as test days. More specifically, 24 days (i.e. 2 days per month) are selected as test days in order to cover all months. Clustering is

applied to the rest days with no missing data, i.e. 341 days. A description of the method is presented below:

refers to the calculation of Euclidean distance between days n+1 and n1 + 1 , and between days n+1 and n2 + 1. Let the

Step#1. Set the number of clusters to k. K-means is applied to the reduced data set of the 341 days. The clustering labels of the 341 days are obtained.

smaller distance corresponds to n2 + 1. Next we use the data of

Step#2. Let n be the number of the day that is missing from the data set. The sequence S of l previous days, i.e. from day n and backwards is extracted. This sequence is denoted as:

Step#6. Calculation the Mean Absolute Range Normalized Error (MARNE) between days n and n r [23]:

n l

S = {ni −l +1, ni −l + 2 ,..., ni −1 , ni }

(1)

Step#3. A correlation analysis is conducted using the Pearson correlation coefficient, in order to determine the correlation between the current time interval (i.e. t=10 min) and the previous days. The results are shown in Fig 2.

n2 to fill the test day n.

1 MARNE= M

M

pma − pmf

 max( p m =1

a m

)

× 100

(2)

Where pla and plf are the actual and filled PV generation load curve of the l-th day, respectively. Step#7. If MARNE is acceptable, terminate the process. Otherwise, increase the number of clusters to k+1 and repeat Step#1 to Step#7. Note that the proposed methodology uses actual days of the data set to complete the missing ones instead of using the profiles for this purpose. Since the profiles are the averages of the PV generation curves of the same clusters, filling a missing day with an average value would eventually lead to lower estimates of the generation levels of the missing day. III.

Fig. 2. Correlation between current and previous time interval.

It can be observed that the PV value of the current 10 min interval is more correlated with the two previous ones. Therefore, the present daily generation curve is more correlated with the previous day and the day before the previous one. Step#4. According to the correlation analysis of Step#3, we select l=2. Then, a search is employed in the whole data set of the same sequences of clusters labels that are similar to the one of the test day. Step#5. Let r be the number of sequences that are similar to those of test day n. Also, we denote as n r the days with the same sequence similarity. Next, we calculate the Euclidean distances between day n+1 and all the nr + 1. Note that day n+1 is the next day of the test day n and it is known. We keep the smaller Euclidean distance, i.e. min{d Eucl (n + 1, nr + 1)}. Then we use the day nr that corresponds to min{d Eucl (n + 1, nr + 1)} in order to fill the missing day n. To clarify this step, we present an illustrative example in Fig.3. In this example, the test day is denoted as n. The two previous days belong to the 5th and 4th cluster, respectively. Therefore, we search for sequence {4, 5} in the whole set. Suppose that two similar sequences are found. The next step

SIMULATION RESULTS

For the present data set, no prior information about the number of clusters is available. Thus, the clustering problem is a purely unsupervised machine learning task. The clustering algorithm will be completely data driven. Following this concept, a series of experiments should take place to define the optimal number of clusters. We have selected the number of clusters to vary from 2 to 30. Apart from the number of clusters, the parameters that need to be determined for the Kmeans are the maximum number of iterations and the minimum amount of improvement of the objective function between two successive iterations. The maximum number of iterations is set to 500 and the minimum improvement value to 10−6. The above values are selected based on experimentation. Low number of iterations corresponds to lower execution times but to poor clusters quality. The conducted experiments on the present data denoted that 500 iterations lead to relatively low execution time and clusterings of high quality. The WCBCR indicator evaluates the K-means performance. The results are illustrated in Fig. 4. While the number of clusters is increasing, the WCBCR receives lower values. Employing the “knee” point detection method we find that the optimal number is 8 [24]. A large number of clusters is not desirable since it adds complexity on the exploitation phase of the clustering outcome. Additionally, a low number of clusters correspond to high clustering error, i.e. clusters with low degree of homogeneity of the cluster members. The PV generation profiles of the 8 clusters are shown in Fig. 5. It can be observed that the profiles differ mostly in terms of magnitude and not in terms of general shape.

Fig. 3. Example of cluster label sequence.

days are met at generation peak hours. This is not the case for 01/05/2013 and 01/07/2013. In these cases, the selected days succeed by a large portion in capturing the shape of the missing ones. The proposed methodology can also be used for days with incomplete data. For this set of experiments, the same test days are used. The incomplete data completion refers to the filling of the days with sporadic measurements. For this, we remove the 50% of the data for every test day. For each day, we remove the same elements that were selected randomly. Fig. 10 and Fig. 11 present two examples of incomplete days. Fig. 12 and Fig. 13 show the actual data of the test days and the days used for completion. TABLE I. Fig. 4. WCBCR scores for different number of clusters.

Fig. 5. PV generation profiles.

The missing data completion phase is held considering 8 clusters, i.e. 8 cluster labels are used. Recall that the pattern sequence similarity length is 2. If this length is increased, the possibility to find similar sequences label length decreases. Table Ι presents the MARNEs per selected test day. The Table also presents the selected day used for completion. It can be observed that MARNE indicator ranges from 0.5344% to 8.7754%, a fact that indicates the robustness of the proposed methodology. Recall that the selected day of completion refers to the minimum Euclidean distance between all succeeding days n+1. The average MARNE is 2.9813%. There is no clear correlation between MARNE values and the type of days. The lowest value is met at 03/01/2013 and the second lowest (MARNE=0.8314%) at 15/11/2013. However, it is shown that higher values are met on April days. Also, the selected days for completion do not always refer to the same season as the test day. For instance, 01/10/2013 is filled with 13/03/2013. This fact denotes that the clustering should include all the available set and not restricted to seasons, i.e. employing different clustering per season. Figs. 6-9 present some examples on missing data filling. The test day is illustrated together with selected day for completion. On 02/01/2013, high deviations between the two

TEST DAY AND SELECTED DAY FOR COMPLETION.

Test day

Selected day

MARNE (%)

03/01/2013

20/12/2013

0.5344

15/01/2013

24/11/2013

4.4463

01/02/2013

09/02/2013

3.2244

15/02/2013

24/10/2013

2.6372

01/03/2013

20/02/2013

2.0761

15/03/2013

11/02/2013

3.7160

01/04/2014

28/08/2013

8.7754

15/04/2014

05/09/2013

5.8752

01/05/2013

05/08/2013

1.7826

15/05/2013

25/06/2013

2.3057

01/06/2013

27/05/2013

1.5332

15/06/2013

12/07/2013

3.8165

01/07/2013

27/05/2013

2.0380

15/07/2013

13/07/2013

1.2092

01/08/2013

11/08/2013

4.0025

15/08/2013

06/09/2013

3.0737

01/09/2013

06/09/2013

1.6226

15/09/2013

20/09/2013

1.7947

01/10/2013

13/03/2013

4.1048

15/10/2013

10/10/2013

2.7728

01/11/2013

05/01/2013

1.6440

15/11/2013

21/02/2013

0.8314

01/12/2013

24/12/2013

2.0790

15/12/2013

14/12/2013

5.6563

Fig. 6. Real daily curve of 02/01/2013 and selected for its completion.

Fig. 7. Real daily curve of 01/05/2013 and selected for its completion.

Fig. 8. Real daily curve of 01/07/2013 and selected for its completion.

Fig. 9. Real daily curve of 01/11/2013 and selected for its completion.

Fig. 10. Example of incomplete test day (15/07/2013).

Fig. 11. Example of incomplete test day (15/09/2013).

Fig. 12. Example of incomplete test day and the selected day for completion (15/07/2013).

Fig. 13. Example of incomplete test day and the selected day for completion (15/09/2013).

IV.

CONCLUDING REMARKS

PV technology has constantly increased its share in electricity generation in many countries. It can be used to cover a variety of loads with special characteristics such as absence of grid connection, located in isolated areas and others. The energy policies of many countries seek ways to increase the PV share in the electricity generation mix. Also, the PVs are a reliable way to transform the consumer to prosumer, a concept that allows consumer to take part in deregulated markets auctions and increase its benefits. The majority of PV installations include metering equipment that measures critical quantities of the system’s operation. The collection and processing of data is crucial in the evaluation of the economic performance of PV installation, and can be also used by utilities to analyze the impact on the grid. Thus, the validity of the collected data is important. However, in many cases the data may contain a high amount of atypical values, such as missing and incomplete entries. This fact can obstruct the exploitation of data. If the incomplete data are disregarded from the set, the overall amount of data decreases, but with possible loss of data credibility. Instead of removing them from the set, the incomplete data can be artificially completed with data entries of high similarity. The present paper proposes a novel methodology for missing and incomplete data completion. This methodology uses the clustering tool in order to group together patterns of the available data into homogeneous clusters. For the purpose of a single demonstration of the methodology, the K-means algorithm is employed. Different algorithms can be applied as well. The methodology is not dependent on data size, data resolution and amount of missing and incomplete data. It can be used for virtually any type of time series. For demonstration reasons, only PV data were investigated. The results presented in the paper indicate the robustness of the methodology. The mean MARNE of the considered set of test days is close to 3%. REFERENCES [1] [2] [3]

[4]

[5]

[6]

[7]

S. R. Madeti and S.N.Singh, “Monitoring system for photovoltaic plants: A review”, Renew. Sust. Enegy Rev., vol. 67, pp. 1180-1207, Jan 2017 C.D. Manning and P. Raghavan. Introduction to Information Retrieval. Cambridge University Press; 2008 P.A. Batista and M.C. Monard. “An analysis of four missing data treatment methods for supervised learning”. Appl. Artif. Intel., vol. 17, pp 519-533, 2003 A. Fahad, N. Alshatri, Z. Tari, A. Alamr, I. Khalil, A.Y. Zomaya, S. Foufou and A. Boura, “A survey of clustering algorithms for Big Data: Taxonomy and empirical analysis,” Em. Top. Comput. vol. 2, pp. 267279, 2014 R. Addo-Tenkorang and P.T. Helo, “Big data applications in operations/supply-chain management: A literature review”, Comput. Ins. Eng., vol. 101, pp. 528-543, November 2016 Grzymala-Busse JW, Goodwin LK. Handling missing attribute values in preterm birth data sets. In D. Slezak, J. Yao, J. F. Peters, W. Ziarko, & X. Hu (Eds.) 2005,in Lecture notes in computer science: Vol. 3642. Rough sets, fuzzy sets, data mining, and granular computing (RSFDGrC 2005) (pp. 3420351). Canada: Springer. D. Li, J. Deogun, W. Spaulding and B. Shuart. Towards missing data imputation: A study of fuzzy K-means clustering method. In S Tsumoto, R. Slowinski, J. Komorowski, & J. W. Grzymala-Busse (Eds.) 2004,

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21] [22]

[23] [24]

[25]

[26]

[27]

Lecture notes in computer science: Vol. 3066. Rough sets and current trends in computing (RSCTC 2004) (pp. 573579). Springer-Verlag. E. Acuna and C. Rodriguez. The treatment of missing values and its effect in the classifier accuracy. In D. Banks, L. House, F. R. McMorris, P. Arabie, & W. Gaul (Eds.) 2004, Classification, clustering and data mining applications (pp. 639-648). Springer-Verlag Berlin-Heidelberg. A.K.C. Wong and D.K.Y. Chiu, “Synthesizing statistical knowledge from incomplete mixed-mode data”, IEEE Trans. Patt. Analys. Mach. Intel., vol. 9, pp. 796-805, November 1987 T. Schneider, “Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values”, J. Clim., vol. 14, pp. 853-871, March 2001 H.A.B. Feng, G.C. Chen, C.D. Yin, B.B. Yang, Y.E. Chen. A SVM regression based approach to filling in missing values. In R. Khosla, R. J. Howlett, & L. C. Jain (Eds.) 2005, Lecture notes in artificial intelligence: Vol. 3683. Knowledge- based intelligent information and engineering systems (KES 2005) (pp. 581-587). Springer. O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein and R.B. Altman, “Missing value estimation methods for DNA microarrays”, Bioinf., vol. 17, pp. 520-525, June 2001 S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara, S.A. Ishii, “Bayesian missing value estimation method for gene expression profile data”, Bioinf. vol. 19, pp. 2088-2096, November 2003 J. Luengo, S. García and F.A. Herrera, “A study on the use of imputation methods for experimentation with Radial Basis Function Network classifiers handling missing attribute values: The good synergy between RBFNs and Event Covering method”, Neural Net., vol.23, pp. 406-418, 2010 J. Luengo, S. García and F. Herrera, “On the choice of the best imputation methods for missing values considering three groups of classification methods”, Knowl. Inf. Syst., vol. 32, pp. 77-108, 2012 S.N. Stefanakos and G.A Athanassoulis, “A unified methodology for the analysis, completion and simulation of nonstationary time series with missing values, with application to wave data”, Appl. Ocean Res., vol. 23, pp. 207-220, August 2001 A. Nikolaidis, G.C. Georgiou, D. Hadjimitsis and E. Akylas, ”Filling in missing sea-surface temperature satellite data over the eastern Mediterranean Sea using the DINEOF Algorithm”, Cent. Eur. J. Geosc. vol. 6, pp. 27-41, March 2014. T. Shireen, C. Shao, H. Wanga, J. Li, X. Zhang, M. Li, “Iterative multitask learning for time-series modeling of solar panel PV outputs”, Applied Energy, Volume 212, 15 February 2018, Pages 654-662. K. Bujna, M. Wistuba, “Multi-plant photovoltaic energy forecasting challenge with regression tree ensembles and hourly average forecasts”, CEUR Workshop Proceedings, Volume 1972, 2017. E. Koubli, D. Palmer, T. Betts, P. Rowley, R. Gottschalg, “Inference of missing PV monitoring data using neural networks”, 43rd IEEE Photovoltaic Specialists Conference, PVSC 2016, Portland, 2016. R. Xu and D. Wunsch. Clustering, New Jersey: John Wiley & Sons. Inc.; 2006. G.C. Christoforidis, T.A. Papadopoulos, I.P. Panapakidis and G.K. Papagiannis, “PV power clustering as a means to evaluate energy storage”, International Conference on Renewable Energy Research and Applications (ICRERA2013), October 2013, Madrid, Spain, pp. 1-6 D. Steinley, “K-means clustering: A half-century synthesis”, British J. Math. Stat. Psychol., vol. 59, pp. 1-34, May 2006 M.E. Celebi, H.A. Kingravi, P.A., Vela, “A comparative study of efficient initialization methods for the k-means clustering algorithm”, Exp. Syst. Appl., vol. 40, 200-21, January 2013 I.P. Panapakidis, M.C. Alexiadis and G.K. Papagiannis, “Enhancing the clustering process in the category model load profiling”, Gen. Trans. Distr., vol. 9, pp. 655-665, April 2015 B. Soldo, P. Potocnik, G. Simunovi, T. Sari and E. Govekar, “Improving the residential natural gas consumption forecasting models by using solar radiation”, Energy Build., vol. 69, pp. 498-506, Feb. 2014 Q. Zhao, V. Hautamaki and P. Fränti, “Knee point detection in BIC for detecting the number of clusters:, International Conference on Advanced Concepts for Intelligent Vision Systems 2008, pp. 664-673.